I have a problem I would like to tackle with RL, but I am not sure if it is even doable.
My agent has to figure out how to fill a very large vector (let's say from 600 to 4000 in the most complex setting) made of natural numbers, i.e. a 600 vector $[2000,3000,3500, \dots]$ consisting of an energy profile for each timestep of a day, for each house in the neighborhood. I receive a reward for each of these possible combinations. My goal is, of course, that of maximizing the reward.
I can start always from the same initial state, and I receive a reward every time any profile is chosen. I believe these two factors simplify the task, as I don't need to have large episodes to get a reward nor I have to take into consideration different states.
However, I only have experience with DQN and I have never worked on Policy Gradient methods. So I have some questions:
I would like to utilize the simplest method to implement, I considered DDPG. However, I do not really need a target network or a critique network, as the state is always the same. Should I use a vanilla PG? Would REINFORCE be a good option?
I get how PG methods work with discrete action space (using softmax and selecting one action - which then gets reinforced or discouraged based on reward). However, I don't get how it is possible to update a continuous value. In DQN or stochastic PG, the output of the neural network is either a Q value or a probability value, and both can be directly updated via reward (the more reward the bigger the Q-value/probability). However, I don't get how this happens in the continuous case, where I have to use the output of the model as it is. What would I have to change in this case in the loss function for my model?