Following the DQN algorithm with experience replay:
Store transition $\left(\phi_{t}, a_{t}, r_{t}, \phi_{t+1}\right)$ in $D$ Sample random minibatch of transitions $\left(\phi_{j}, a_{j}, r_{j}, \phi_{j+1}\right)$ from $D$ Set
$$y_{j}=\left\{\begin{array}{cc}r_{j} & \text { if episode terminates at j+1} \\ r_{j}+\gamma \max _{d^{\prime}} \hat{Q}\left(\phi_{j+1}, a^{\prime} ; \theta^{-}\right) & \text {otherwise }\end{array}\right.$$
Perform a gradient descent step on $\left(y_{j}-Q\left(\phi, a_{j} ; \theta\right)\right)^{2}$ with respect to the network parameters $\theta$.
We calculate the $loss=(Q(s,a)-(r+Q(s+1,a)))^2$.
Assume I have positive but changing rewards. Meaning, $r>0$.
Thus, since the rewards are positive, by calculating the loss, I notice that almost always $Q(s)< Q(s+1)+r$.
Therefore, the network learns to always increase the $Q$ function , and eventually, the $Q$ function is higher in same states in later learning steps.
How can I stabilize the learning process?