I have a multi-agent environment where agents are trying to optimise the overall energy consumption of their group. Agents can exchange energy between themselves (actions for exchange of energy include - request, deny request, grant), which they have produced from renewable sources and is stored in their individual batteries. The overall goal is to reduce the energy used from non-renewable sources.
All agents have been built using DQN. All (S,A) pairs are stored in a replay memory which are extracted when updating the weights.
The reward function is modelled as such — if at the end of the episode the aggregate consumption of the agent group from non-renewable sources is lesser than the previous episode, all agents are rewarded with +1. If not, then -1. An episode (iteration) consists of 100 timesteps after which the reward is calculated. I update the weights after each episode.
The reward obtained at the end of the episode is used to calculate the error for ALL (S,A) pairs in the episode i.e. I am rewarding all (S,A) in that episode with the same reward.
My problem is that agents are unable to learn the optimal behavior to reduce the overall energy consumption from non-renewable sources. The overall consumption of the group is oscillating i.e. sometimes increasing and sometimes decreasing. Does it have to do with the reward function? Or Q learning as the environment is dynamic?