I have a custom PPO implementation, and it works fine, but I need to add to it the ability to select 2 actions per turn, one different in nature from the other, one dependent on the other.
Imagine that a turn a had 20 possible actions. Now, for each of these action of type A, I need to choose one from 3 possible actions B. They are not the same, and thus they can not be on different turns.
What I tried was to flatten the 20x3 space to a 60. Then, action A is 60//3 and action B is 60%3. But this does not train well. Are there any good methods for this issue?