1

I have a custom PPO implementation, and it works fine, but I need to add to it the ability to select 2 actions per turn, one different in nature from the other, one dependent on the other.

Imagine that a turn a had 20 possible actions. Now, for each of these action of type A, I need to choose one from 3 possible actions B. They are not the same, and thus they can not be on different turns.

What I tried was to flatten the 20x3 space to a 60. Then, action A is 60//3 and action B is 60%3. But this does not train well. Are there any good methods for this issue?

1 Answers1

0

What you describe will be possible, but you need to ensure that your policy returns the probability of the action(s) taken.

In the two step process you describe, you would have something like \begin{align*} \mathbb{P}(a_t) = \mathbb{P}(a_{(t, 1)}=A)\mathbb{P}(a_{(t, 2)}=B \, | \, a_{(t, 1)}=A) \end{align*} where $a_t$ is the action taken, which consists of the two steps $a_{(t,1)}$ and $a_{(t,2)}$.

If you aren't strong in probability, here is another way to think about it. You have 60 possible actions (20 possibilities for A, and for each A, 3 possibilities for B). For each of the 60 actions, the policy must return the probability of selecting that (A, B) action tuple. If you sum the probability over all 60 of the action pairs, it must sum to 1. If it does not sum to 1, the gradient will not be correct.

Taw
  • 1,161
  • 3
  • 10