1

I am developing an RL agent for a game environment. I have found out that there are two strategies to do well in the game. So I have trained two RL agents using neural networks with distinct reward functions. Each reward function corresponds to one of the strategies.

If I use Q-Learning or in general value-based methods, it is easy to combine the results of the two agents to select the action that maximizes the overall value. One could just add the values of each action from two agents.

My question is, is it possible to combine the result for a policy based method, e.g. PPO? The output of policy based methods is a probability distribution over actions, and I am not sure how to combine them.

nbro
  • 39,006
  • 12
  • 98
  • 176
BlackBrain
  • 111
  • 2

0 Answers0