0

I'm working on a personal MARL project with a high-dimensional and continuous action space. The environment is designed to give positive rewards to actions between some moving limits of the action range, and negative rewards to the actions outside of those limits. For example:

  • Global action range: (0, 1000)
  • Desired action range for first 100k steps: (0, 10)
  • Desired action range for 100-200k steps: (30, 40)
  • ...

Therefore, the main challenge of the environment is that actions with positive rewards on certain stage of the environment would return negative rewards on the following stages. How should I define the actions of the agent? I've tried the following methods without success:

  • Simply scale actions between 0 and 1000 and hope that agents learn the moving distribution of rewards
  • Transform actions to percent variations and scale actions over a non-observed moving average (I tried adding the moving average to the observations but the results stayed the same)
  • Observations do consider a dimension that serve to differentiate when a distributional shift happens
  • Also, I've tried using SAC and DDPPG to model agents

Feel free to share any comments or suggestions.

desertnaut
  • 1,005
  • 10
  • 19
  • Not 100% clear whether your goal is to create adaptive continually learning agents, assessing them for ability to cope with rules changes? Or if your goal is to teach the agents to perform optimally on reruns of the entire environment, having learnt to anticipate the changes – Neil Slater Sep 02 '23 at 20:09

0 Answers0