4

I have a problem I would like to tackle with RL, but I am not sure if it is even doable.

My agent has to figure out how to fill a very large vector (let's say from 600 to 4000 in the most complex setting) made of natural numbers, i.e. a 600 vector $[2000,3000,3500, \dots]$ consisting of an energy profile for each timestep of a day, for each house in the neighborhood. I receive a reward for each of these possible combinations. My goal is, of course, that of maximizing the reward.

I can start always from the same initial state, and I receive a reward every time any profile is chosen. I believe these two factors simplify the task, as I don't need to have large episodes to get a reward nor I have to take into consideration different states.

However, I only have experience with DQN and I have never worked on Policy Gradient methods. So I have some questions:

  1. I would like to utilize the simplest method to implement, I considered DDPG. However, I do not really need a target network or a critique network, as the state is always the same. Should I use a vanilla PG? Would REINFORCE be a good option?

  2. I get how PG methods work with discrete action space (using softmax and selecting one action - which then gets reinforced or discouraged based on reward). However, I don't get how it is possible to update a continuous value. In DQN or stochastic PG, the output of the neural network is either a Q value or a probability value, and both can be directly updated via reward (the more reward the bigger the Q-value/probability). However, I don't get how this happens in the continuous case, where I have to use the output of the model as it is. What would I have to change in this case in the loss function for my model?

nbro
  • 39,006
  • 12
  • 98
  • 176
FS93
  • 145
  • 6

0 Answers0