2

I have a multi-agent environment where agents are trying to optimise the overall energy consumption of their group. Agents can exchange energy between themselves (actions for exchange of energy include - request, deny request, grant), which they have produced from renewable sources and is stored in their individual batteries. The overall goal is to reduce the energy used from non-renewable sources.

All agents have been built using DQN. All (S,A) pairs are stored in a replay memory which are extracted when updating the weights.

The reward function is modelled as such — if at the end of the episode the aggregate consumption of the agent group from non-renewable sources is lesser than the previous episode, all agents are rewarded with +1. If not, then -1. An episode (iteration) consists of 100 timesteps after which the reward is calculated. I update the weights after each episode.

The reward obtained at the end of the episode is used to calculate the error for ALL (S,A) pairs in the episode i.e. I am rewarding all (S,A) in that episode with the same reward.

My problem is that agents are unable to learn the optimal behavior to reduce the overall energy consumption from non-renewable sources. The overall consumption of the group is oscillating i.e. sometimes increasing and sometimes decreasing. Does it have to do with the reward function? Or Q learning as the environment is dynamic?

amitection
  • 307
  • 2
  • 6

1 Answers1

3

Does it have to do with the reward function?

This seems likely to me. You have chosen a reward that is unusual in that it cross-links episodes. It is not really a reinforcement problem to optimise behaviour with respect to results of previous episode behaviour in this way. This might be an option for an evolutionary fitness context, if you have competing teams against the same environment in a tournament style selection.

Reinforcement learning should really take as direct measure of your goal as you can construct. In this case you want to minimise some scalar quantity, and that is an obvious candidate as a negative reward. So the reward should be the negative of the total non-renewable energy consumption. The maximum possible value in theory for any single episode would be zero.

You may still have problems with a multi-agent setup oscillating using Q-learning, it will depend on how much each agent views the full, relevant state, and whether you are training several distinct agents at the same time (more likely to oscillate), or a single type of agent with multiple instances in each environment (less likely to oscillate, but still can if it has too blinkered a view of the environment as experienced by the other agents). But having a single shared goal with shared reward like you do here should in theory help with stability.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • I see what you mean. Two questions: **First** - If I reward the agent negatively every time wouldn't this be a problem? I do understand that the value will be smaller or greater everytime. **Second** - Is it okay if I distribute (apply) the same reward across all (S,A) pairs in the episode or should I be rewarding only in the final (S,A) pair at the end of the episode? Thanks. – amitection Jul 09 '18 at 07:50
  • @Neil Slater It is possible that the agent might never get a positive reward as the negative of total non-renewable energy consumption can only be zero in an ideal scenario. Additionally, I did not understand the third paragraph in the answer, by "several distinct agents" do you mean with different implementations? I have multiple agents running as separate processes with the same implementation. – amitection Jul 09 '18 at 08:02
  • 1
    @amitection: The absolute value of the reward should not matter in this case. There is no special need for positive rewards in general in RL. Although you do have to take care choosing some absolute values, in this case re-framing a positive cost as a negative reward is just fine. By "distinct" I mean different implementation and/or different learning parameters. If the agents are separated and learn in isolation, this leaves you more open to co-evolutions appearing as local minima. It may still be OK though for you, since the only criterion for success is the shared group goal. – Neil Slater Jul 09 '18 at 08:30
  • 1
    @amitection: Your **Second** question in the comment above: It is not OK to re-distribute reward directly like this in general. One trait of TD learning (of which Q learning is an example) is that it will, eventually, figure out long-term consequences from sparse rewards - e.g. a single reward value granted at the end. In your case, declaring the same reward for all actions is probably counter-productive. If possible, the actual reward at any time step should be the negative amount of non-renewable energy used at that time step. If that is not possible, use one large (-ve) reward at the end. – Neil Slater Jul 09 '18 at 08:39