In deep Q-learning, $Q(s, a)$ and $Q'(s, a)$ are predicted or estimated by the neural network itself. In supervised learning, the target value is a true unbiased value. However, this isn't the case in reinforcement learning. So, how can we be sure that deep Q-learning converges? How do we know that the target Q values are accurate?
Asked
Active
Viewed 181 times
2
-
See https://ai.stackexchange.com/q/21053/2444 and https://ai.stackexchange.com/q/11679/2444. – nbro Aug 03 '20 at 14:45