Should the network weights converge when training Deep Q networks?

Question

I have two sets of data, one training and one test set. I use the train set to train the deep q network model variant. I also continuously evaluate the agent Q values obtained on the test set every 5000 epochs I find that the agent Q values on the test set do not converge and neither do the policies.

iteration $x$: Q values for the first 5 test data are [15.271439, 13.013742, 14.137051, 13.96463, 11.490129] with policies: [15, 0, 0, 0, 15]

iteration $x+10000$: Q values for the first 5 test data are [15.047309, 15.5233555, 16.786497, 16.100864, 13.066223] with policies: [0, 0, 0, 0, 15]

This means that the weights of the neural network are not converging. Although I can manually test each policy at each iteration and decide which of the policy performs best, I would like to know if correct training of the network would lead to weight convergence ?

Training loss plot:

You can see that the loss decreases over time however, there are occasional spikes in the loss which does not seem to go away.

if you already have the data, why are you using rl? would this not be a supervised/unsupervised problem? — David, Jun 09 '20 at 09:10
It is not a supervised learning problem. The train set contains experiences and can be seen as the "replay buffer", just that the no new experiences can be collected. As I cannot obtain the "rewards" for the current policy in real time, I am evaluating the trained DQN on the "replay buffer" and analysing the Q scores that the neural network would output on unseen states (test data) for every few cycles of epochs — calveeen, Jun 09 '20 at 09:49
how do you generate the experience? will the experience you store not be from a really poor policy since you're not optimising at the same time? it sounds like you would have to gather experience, train the dqn a bit to improve the policy, then get more experience, otherwise you will just continue to get experience from a poor policy and the dqn could never get close to the optimal solution. — David, Jun 09 '20 at 10:10
so because I am trying to apply reinforcement learning techniques on clinical data, it would not be possible for me to gather new experiences and I would have to work with the current collected experiences demonstrated by the clinicans. — calveeen, Jun 09 '20 at 10:12
what type of clinical data is it? are you trying to figure out the best treatment group? — David, Jun 09 '20 at 10:22
I’m trying to figure out the best treatment policies for sepsis patients and patient states are continuous variables, hence the DQN is needed. The actions (drug doses) are discretised into different bins. — calveeen, Jun 09 '20 at 10:26
What does the loss look like when you are training the DQN? Is it decreasing over time? — user5093249, Jun 09 '20 at 12:11
@user5093249 the loss decreases over time, but there are still occasional spikes that are difficult to get rid off despite training for longer periods of time. The deep Q network uses a prioritised experience replay together with some regularisation. I have linked the paper that I have adapted the model from here. https://arxiv.org/abs/1711.09602. Are the Q values being learnt well if it does not converge ? — calveeen, Jun 09 '20 at 12:55
Thanks @calveeen. I also came across that paper while trying to understand how DQN fits in your use case. Re- the plot, the fact that it has a general decreasing trend seems good. The spikes may be due to the instants where you update the target network (or other reasons see [this](https://stats.stackexchange.com/questions/303857/explanation-of-spikes-in-training-loss-vs-iterations-with-adam-optimizer) or [this](https://stackoverflow.com/questions/47824598/why-does-my-training-loss-have-regular-spikes)). — user5093249, Jun 09 '20 at 14:48
I would expect the Q values of the agent's chosen action to converge for the test data though as this would entail that the network weights have converged. However the policies obtained do not seem to resemble each other (I do a evaluation on the test set every 10,000 epochs). This makes it very difficult to know whether the optimal policy has truly been found — calveeen, Jun 09 '20 at 14:59
It is not very clear to me how you use your test data to evaluate the DQN. Doesn’t the test data contain the actions already taken by the physician? Do you replace those by the actions returned by the trained DQN? Re- “This makes it very difficult to know whether the optimal policy has truly been found”, I agree with this answer [here](https://ai.stackexchange.com/questions/21182/how-to-evaluate-a-deep-q-network) that it would be better to evaluate the training performance of your DQN based on the average reward it is getting instead of relying on the Q-values. — user5093249, Jun 09 '20 at 15:15
Yes the test data contains already the actions taken by physicians. What I am doing is that I am feeding the states in the test set into the trained DQN, I then analyse the agent's Q values and actions (based on greedy policy) and I notice that the Q values of the argmax action does not converge, and the actions of the agent likewise do not seem to approach an optimal policy. I provided a sample of what the Q values look like when this evaluation was carried out at different time steps. — calveeen, Jun 09 '20 at 15:20
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/109103/discussion-between-user5093249-and-calveeen). — user5093249, Jun 09 '20 at 15:37

Should the network weights converge when training Deep Q networks?

0 Answers0