Is it possible to successively train an RL agent on the same environment with different data

Question

I have a scheduling problem that I am trying to solve with RL (if you are interested in more details you can read about it here Reinforcement learning applicable to a scheduling problem?).

I have created an own environment (OpenAI-Gym) and I have trained the model for one specific day of the simulation. So I have 288 timesteps for one day (1 for each 5 minutes) and the simulation last until the end of the day. So the agent needs to make 288 decisions when having 1 control variable.

Now my question is whether it is possible to successively train an RL agent on the same environment for different days? The environment and reward function will stay the same but the input data will change as every day has different input data (temperature, heat demand, electricity price etc.). So I would like to train an agent for one day and then tell the agent to train on another day but not forget everything it has learned during the training of the first day. Thus I can make sure that the agent is not overfitting to one special input data but also has the ability to generalize and thus be applicable for different days.

Do you know if and how I can do this?

Reminder: Can anybody tell me more about this by now. I'll highly appreciate every further comment as I still don't know how to do this.

score 1 · Answer 1 · answered May 19 '22 at 17:51

1

You can mitigate catastrophic forgetting by storing the trajectories generated by the actors during training in a replay buffer. Then, you sample trajectories from that replay buffer. This way, each mini-batch of experience will contain data from multiple days.

There are many strategies to do this sampling, but you can start with uniform sampling. From what you're describing, it doesn't seem that storage is going to be an issue (288 data points per day is small), so you can keep all trajectories. If you can't afford to store all trajectories, then you should also design a strategy to remove them from the replay buffer.

You can refer to this handy guide describing how to implement a replay buffer in TensorFlow.

answered May 19 '22 at 17:51

Raphael Lopez Kaufman

600
1
9

Thanks Raphael for your answer. I have to admit that I have problems understanding it. You mentioned the catastrophic forgetting. Actually, I don't want to forget the information from the previous days. And what do you mean by sampling trajectories from a replay buffer? And just for the record: I am not using TensorFlow for reinforcement learning so the posted link is not suitable for me (altough I appreciate that you have posted it). – PeterBe May 23 '22 at 07:29
Thanks Rapahel for your answer. Any comments to my last comment? I'll highly appreciate every further comment from you. – PeterBe May 30 '22 at 08:26
Any further comments? I have problems understanding your answer. Would you mind elaborating a little bit more on it (see my first comment). – PeterBe Jun 07 '22 at 09:06
Apologies I was on holidays :) The method I'm describing is supposed to prevent catastrophic forgetting. The idea is to store the 288 decisions the agent has made on day 1 (with associated state and reward). That's a trajectory. Then the next day, you train on a mixture of the decisions made on this day and those of day 1. On day 3 you train on a mixture of decisions made on this day and those of day 1 and 2, and so on. Is that clearer? – Raphael Lopez Kaufman Jun 09 '22 at 18:14
Thanks for your comment Raphael. Actually I still don't understand it. Shall I train an agent and then test it on day 1 and store the values for each timeslot? But how shall this work. After training the model, I can just apply it on other days. How can I tell the model to futher learn another way and still somehow consider the values from the training of day 1? – PeterBe Jun 10 '22 at 11:51
I don't know what ML framework you use, but it's fairly standard in most to continue training the same model architecture by restoring the weights from a previous training run as the starting point for a new training run. So, on day two, reload the weights from the training done on day one, and train the model on a mixture of day one and day two data. On day three, reload the weights from the training done on day two and train the model on a mixture of day one, two and three data, and so on. Does it make sense? – Raphael Lopez Kaufman Jun 13 '22 at 17:41
Thanks for your answer and effort. I really appreciate it. I use the Framework Stable-Baselines 3 (https://stable-baselines3.readthedocs.io/en/master/) and currently the algorithm A2C. So I have to find a way to continue training the model after having loaded weights from a previous run. What I don't understand in your comment is how to "train the model on a mixture of day one and day two data". My environment is a simulation of different days. One run is for one day. So it does not make sense to "mix" the days as the data would be inconsistant and unrealistic if you create a new day – PeterBe Jun 14 '22 at 08:38
A2C is not a good fit for experience replay because it's an on policy algorithm (there are off policy variants though). You could switch to something like DQN or offline RL which are better suited to replaying experience. In DQN, it's fine to mix data from several days because you learn on transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ rather than whole episodes. PS: don't forget to upvote my comments if you find them useful! – Raphael Lopez Kaufman Jun 14 '22 at 18:48
Thanks for your comment. The reason why I don't use DQN is that I have a continious 3-dimensional action space (meaning I have 3 continuous variables). To be totally honest, I still have fundamental problems with your suggested approach. Mixing the days is not reasonable as I also get a cumulative reward at the end of the episode (next to the reward after every action). – PeterBe Jun 15 '22 at 08:29
Further, I don't understand how your trajectory approach should work. You said I should "store the 288 decisions the agent has made on day 1". Are you talking about the weights of the model or why else should I store the decisions of the agent after training? Your described pipeline is not clear to me. When I train the model on day 1, then I could store the weights of the model. Shall I then run the agent again on day 1 with the stored weights of the model and test the agent on the same day and store which decisions it is making? If so, what shall I do with the stored actions? – PeterBe Jun 15 '22 at 08:31
The terminal reward is not a problem, it's part of the final transition. You train your agent on batches of transitions, not on whole trajectories. That's what you're mixing: in your batches, you'll have transitions from day 1, day 2, etc. When I say "store the 288 decisions the agent has made on day 1", I mean the 288 transitions $(s_t, a_t, r_{t+1}, s_{t+1})$. At the end of training on these 288 transitions, you should separately store the weights of the model. – Raphael Lopez Kaufman Jun 15 '22 at 19:23
Thanks Raphael for your comment and effort. I really appreciate it. Unfortunately I have to admit that I don't understand your answer and still don't know what to do. In my current setting, I train the agent only for one day and afterwards I can test it. Now I don't understand what to do. One training episode always ends after one day. I have to find a way to tell the training algorithm to switch the days during the training after one episode. How can I do that? I think this should be independant from your trajectory and replay buffer approach – PeterBe Jun 20 '22 at 15:27
Thanks Rapahel for your answer. Any comments to my last comment? I'll highly appreciate every further comment from you. – PeterBe Jun 24 '22 at 07:50
I'm not sure how I can be of more help unfortunately. Hopefully someone else understands your problem better than I do – Raphael Lopez Kaufman Jul 07 '22 at 18:25
@Rapahel: Thanks for your comment. Would you mind giving a little bit more information about how to do the things you suggest? How can I generate the trajectories and how can I train on a mixture of day 1 and 2 as you describe? For example you wrote "So, on day two, reload the weights from the training done on day one, and train the model on a mixture of day one and day two data. " --> How can I mix those 2 days. Maybe you can describe how to successively train an RL agent on the same environment with different data on your own or maybe you know a link where this is described – PeterBe Jul 09 '22 at 15:57
Thanks for your comments. Any comment to my last comment? I'll highly appreciate every further comment from you. – PeterBe Jul 18 '22 at 06:46
@ Raphael Lopez Kaufman: Thanks Raphael for your comments so far. Would you mind elaborating a little bit more on my previous comment. There I describe what I don't understand about your given answers so far. – PeterBe Jul 22 '22 at 09:21

Is it possible to successively train an RL agent on the same environment with different data

1 Answers1