The most straightforward solution is to simply make every action "legal", but implementing a consistent, deterministic mapping from potentially illegal actions to different legal actions. Whenever the PPO implementation you are using selects an illegal action, you simply replace it with the legal action that it maps to. Your PPO algorithm can then still update itself as if the illegal action were selected (the illegal action simply becomes like... a "nickname" for the legal action instead).
For example, in the situation you describe:
- 2 actions (0 and 1) are always available
- 2 actions (2 and 3) are only available when the internal_state == 0
- 1 action (4) is only available when the internal_state == 1
In cases where internal_state == 0
, if action 4
was selected (an illegal action), you can always swap it out for one of the other actions and play that one instead. It doesn't really matter (theoretically) which one you pick, as long as you're consistent about it. The algorithm doesn't have to know that it picked an illegal action, whenever it picks that same illegal action in the future again in similar states it will consistently get mapped to the same legal action instead, so you just reinforce according to that behaviour.
The solution described above is very straightforward, probably the most simple to implement, but of course it... "smells" a bit "hacky". A cleaner solution would involve a step in the Network that sets the probability outputs of illegal actions to $0$, and re-normalizes the rest to sum up to $1$ again. This requires much more care to make sure that your learning updates are still performed correctly though, and is likely a lot more complex to implement on top of an existing framework like Tensorforce (if not already somehow supported in there out of the box).
For the first "solution", I wrote above that it does not matter "theoretically" how you choose you mapping. I absolutely do expect your choices here will have an impact on learning speed in practice though. This is because, in the initial stages of your learning process, you'll likely have close-to-random action selection. If some actions "appear multiple times" in the outputs, they will have a greater probability of being selected with the initial close-tor-andom action selection. So, there will be an impact on your initial behaviour, which has an impact on the experience that you collect, which in turn also has an impact on what you learn.
I certainly expect it will be beneficial for performance if you can include input feature(s) for the internal_state
variable.
If some legal actions can be identified that are somehow "semantically close" to certain illegal actions, it could also be beneficial for performance to specifically connect those "similar" actions in the "mapping" from illegal to legal actions if you choose to go with that solution. For example, if you have a "jump forwards" action that becomes illegal in states where the ceiling is very low (because you'd bump your head), it may be better to map that action to a "move forwards" action (which is still kind of similar, they're both going forwards), than it would be to map it to a "move backwards" action. This idea of "similar" actions will only be applicable to certain domains though, in some domains there may be no such similarities between actions.