I am currently studying reinforcement learning, especially DQN. In DQN, learning proceeds in such a way as to minimize the norm (least-squares, Huber, etc.) of the optimal Bellman equation and the approximate Q-function as follows (roughly): $$ \min\|B^*Q^*-\hat{Q}\|. $$ Here $\hat{Q}$ is an estimator of Q function, $Q^*$ is the optimal Q function, and $B^*$ is the optimal Bellman operator. $$ B^*Q^*(s,a)=\sum_{s'}p_T(s'|s,a)[r(s,a,s')+\gamma \max_{a'}Q^*(s',a')], $$ where $p_T$ is a transition probability, $r$ is an immediate reward, and $\gamma$ is a discount factor. As I understand it, in the DQN algorithm, the optimal Bellman equation is approximated by a single point, and the optimal Q function $Q^*$ is further approximated by an estimator different from $\hat{Q}$, say $\tilde{Q}$. \begin{equation}\label{question} B^*Q^*(s,a)\approx r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\approx r(s,a,s')+\gamma\max_{a'}\tilde{Q}(s',a'),\tag{*} \end{equation} therefore the problem becomes as follows: $$ \min\|r(s,a,s')+\gamma\max_{a'}\tilde{Q}(s',a')-\hat{Q}(s,a)\|. $$
What I want to ask: I would like to know the mathematical or theoretical background of the approximation of \eqref{question}, especially why the first approximation is possible. It looks like a very rough approximation. Can the right-hand side be defined as an "approximate Bellman equation"? I have looked at various literature and online resources, but none of them mention exact derivation, so I would be very grateful if you could tell me about reference as well.