How to implement a variable action space in Proximal Policy Optimization?

Question

I'm coding a Proximal Policy Optimization (PPO) agent with the Tensorforce library (which is built on top of TensorFlow).

The first environment was very simple. Now, I'm diving into a more complex environment, where all the actions are not available at each step.

Let's say there are 5 actions and their availability depends on an internal state (which is defined by the previous action and/or the new state/observation space):

2 actions (0 and 1) are always available
2 actions (2 and 3) are only available when the internal state is 0
1 action (4) is only available when the internal state is 1

Hence, there are 4 actions available when the internal state is 0 and 3 actions available when the internal state is 1.

I'm thinking of a few possibilities to implement that:

Change the action space at each step, depending on the internal state. I assume this is nonsense.
Do nothing: let the model understand that choosing an unavailable action has no impact.
Do almost nothing: impact slightly negatively the reward when the model chooses an unavailable action.
Help the model: by incorporating an integer into the state/observation space that informs the model what's the internal state value + bullet point 2 or 3

Are there other ways to implement this? From your experience, which one would be the best?

There are a few questions related to this one, such as [this](https://ai.stackexchange.com/q/2980/2444) or [this](https://ai.stackexchange.com/q/15612/2444). — nbro, Nov 17 '20 at 21:55

Dennis Soemers · Accepted Answer · 2018-10-21T19:30:39.290

The most straightforward solution is to simply make every action "legal", but implementing a consistent, deterministic mapping from potentially illegal actions to different legal actions. Whenever the PPO implementation you are using selects an illegal action, you simply replace it with the legal action that it maps to. Your PPO algorithm can then still update itself as if the illegal action were selected (the illegal action simply becomes like... a "nickname" for the legal action instead).

For example, in the situation you describe:

2 actions (0 and 1) are always available

2 actions (2 and 3) are only available when the internal_state == 0

1 action (4) is only available when the internal_state == 1

In cases where internal_state == 0, if action 4 was selected (an illegal action), you can always swap it out for one of the other actions and play that one instead. It doesn't really matter (theoretically) which one you pick, as long as you're consistent about it. The algorithm doesn't have to know that it picked an illegal action, whenever it picks that same illegal action in the future again in similar states it will consistently get mapped to the same legal action instead, so you just reinforce according to that behaviour.

The solution described above is very straightforward, probably the most simple to implement, but of course it... "smells" a bit "hacky". A cleaner solution would involve a step in the Network that sets the probability outputs of illegal actions to $0$, and re-normalizes the rest to sum up to $1$ again. This requires much more care to make sure that your learning updates are still performed correctly though, and is likely a lot more complex to implement on top of an existing framework like Tensorforce (if not already somehow supported in there out of the box).

For the first "solution", I wrote above that it does not matter "theoretically" how you choose you mapping. I absolutely do expect your choices here will have an impact on learning speed in practice though. This is because, in the initial stages of your learning process, you'll likely have close-to-random action selection. If some actions "appear multiple times" in the outputs, they will have a greater probability of being selected with the initial close-tor-andom action selection. So, there will be an impact on your initial behaviour, which has an impact on the experience that you collect, which in turn also has an impact on what you learn.

I certainly expect it will be beneficial for performance if you can include input feature(s) for the internal_state variable.

If some legal actions can be identified that are somehow "semantically close" to certain illegal actions, it could also be beneficial for performance to specifically connect those "similar" actions in the "mapping" from illegal to legal actions if you choose to go with that solution. For example, if you have a "jump forwards" action that becomes illegal in states where the ceiling is very low (because you'd bump your head), it may be better to map that action to a "move forwards" action (which is still kind of similar, they're both going forwards), than it would be to map it to a "move backwards" action. This idea of "similar" actions will only be applicable to certain domains though, in some domains there may be no such similarities between actions.

I've already read this answer, but I am not sure that you emphasize the reasons behind your proposals (which are not intuitive and clear, at first, unless you are familiar with implementations of deep RL algorithms), i.e. neural networks typically require fixed-size inputs and outputs (though the OP seems to be aware of this when he says option 1 is nonsense). In your other answer [here](https://ai.stackexchange.com/a/9514/2444), you explain that, but it may be worth editing this answer to be self-contained, cuz I also closed that post as "needs more focus", given that it asks many questions. — nbro, Nov 17 '20 at 19:06
Also, I am not familiar with the details of PPO, but what did the OP mean by "internal state"? I think he was talking about the current state, but I am not sure, because the OP also says "by incorporating an integer into the state/observation space that informs the model what's the internal state value". I guess he's referring to the augmentation of the state (or observation) space with an integer that indicates which subset of the action space is available in that state. Is that what he meant? I'm asking this because, if by "internal state" he meant "state", then I would change the wording. — nbro, Nov 17 '20 at 19:18
@nbro When I read "internal state", my first instinct is to interpret it as some property of the state that is relevant, but not observable to the agent. So we as humans know that there must be some variable like that as part of the state, but it's not explicitly exposed to the agent. I'm more used to seeing it just in programming contexts outside of AI/RL though. For instance, a Random Number Generation object often has internal state (like seed, often some other stuff) which is not part of its public API — Dennis Soemers, Nov 17 '20 at 19:42

score 2 · Answer 2 · answered Aug 23 '20 at 15:12

Change the action space at each step, depending on the internal_state. I assume this is nonsense.

Yes, this seems overkill and makes the problem unnecessarily complex, there could be other things you can do.

Do nothing : let the model understand that choosing an unavailable action has no impact.

While this will not harm your model negatively, in any way the negative thing about this is that the model could take a long time to understand that some actions don't matter for some values of state.

Do -almost- nothing : impact slightly negatively the reward when the model chooses an unavailable action.

Same response as for the previous point except that this could harm your model negatively (but not sure if this is significant). Assume you give a reward of -100 for every illegal action it takes. Looking at only the negative rewards, we have:

-100 when initial_state == 0
-200 when initial state == 1 By doing this, you might be implicitly favoring situations where state == 0. Plus, I dont see the point of the -100 reward anyway since once they come to state 0 they will have to choose a value for the illegal actions too (it's not like they can ignore the values for the action and escape -100 reward)

Help the model : by incorporating an integer into the state/observation space that informs the model what's the internal_state value + bullet point 2 or 3

To me, this seems like the most ideal thing to do. You could branch out the final output layer (of your ActorCritic Agent) into 2 instead of 1.

Like: Input layer, fc1 (from input layer), output1(from fc1), output2(from fc1) Based on initial_state you can get the output from output1 or output2.

You say: "Yes, this seems overkill and makes the problem unnecessarily complex". Why exactly is it overkill? The problem is the use of neural networks in PPO. — nbro, Nov 17 '20 at 19:10

score 1 · Answer 3 · answered May 11 '19 at 09:30

1

Normally, the set of actions that the agent can execute does not change over time, but some actions can become impossible in different states (for example, not every move is possible in any position of the TicTacToe game).

Take a look as example at pice of code https://github.com/haje01/gym-tictactoe/blob/master/examples/base_agent.py :

ava_actions = env.available_actions()
action = agent.act(state, ava_actions)
state, reward, done, info = env.step(action)

answered May 11 '19 at 09:30

madpower2000

11
1

Note that the question was specifically asked in the context of PPO, while the implementation at the linked Github repository does not deal with neural networks. – nbro Nov 17 '20 at 19:25
If you use tensorforce lib, use action masking feature, like this: https://github.com/tensorforce/tensorforce/blob/master/examples/action_masking.py – madpower2000 Jul 27 '21 at 08:18

How to implement a variable action space in Proximal Policy Optimization?

3 Answers3

Linked