5

I have a use case where the set of actions is different for different states. Is the agent aware of what actions are valid for each state, or is the agent only aware of the entire action space (in which case I guess the environment needs to discard invalid actions)?

I presume the answer is yes, but I would like to confirm.

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

1

This is actually an implementation choice, and will depend on how you chose to represent the agent's model of the function that maps from states to actions.

If you explicitly represent the entire state space, as you might chose to do with simple benchmark problems that you solve by directly solving an MDP with something like value iteration, then you can also easily explicitly represent exactly the set of actions that the agent can perform in each state, and the agent can learn the expected value of just taking those actions.

If your state space is very large, you may not be able to represent it explicitly, and your agent is more likely to use some approximation of the value function or its policy, as is commonly done in Q-Learning. Here, it is often preferable to define your model of the environment so that taking an invalid action in a state causes some well-defined outcome, or causes the agent to randomly re-select its actions until it ends up picking a valid one. The agent will eventually learn that selecting an invalid action leads to bad outcomes, without "realizing" that the action is invalid.

nbro
  • 39,006
  • 12
  • 98
  • 176
John Doucette
  • 9,147
  • 1
  • 17
  • 52