4

The question is more or less in the title.

A Markov decision process consists of a state space, a set of actions, the transition probabilities and the reward function. If I now take an agent's point of view, does this agent "know" the transition probabilities, or is the only thing that he knows the state he ended up in and the reward he received when he took an action?

nbro
  • 39,006
  • 12
  • 98
  • 176
Felix P.
  • 287
  • 1
  • 6

1 Answers1

4

In reinforcement learning (RL), there are some agents that need to know the state transition probabilities, and other agents that do not need to know. In addition, some agents may need to be able to sample the results of taking an action somehow, but do not strictly need to have access to the probability matrix. This might be the case if the agent is allowed to backtrack for instance, or to query some other systems that simulates the target environment.

Any agent that needs to have access to the state transition matrix, or look-ahead samples of the environment is called model-based. The model in this case can either be a distribution model i.e. the state transition matrix, or it can be a sampling model that simulates the outcome from a given state/action combination.

The state transition function $p(r, s'|s, a)$ which returns the probability of observing reward $r$ and next state $s'$ given the start state $s$ and action $a$, is another way to express the distribution model. It often maps simply to the state transition matrix, but can be a more complete description of the model.

One example model-based approach is Value Iteration, and that requires access to the full distribution model in order to process value update steps. Also, any reinforcement learning that involves planning must use some kind of model. MCTS, as used in AlphaGo, uses a sampling model for instance.

Many RL approaches are model-free. They do not require access to a model. They work by sampling from the environment, and over time learn the impact on expected results due to behaviour of the unknown state transition function. Example methods that do this are Monte Carlo Control, SARSA, Q learning, REINFORCE.

It is possible to combine model-free and model-based methods by using observations to build an approximate model of the environment, and using it in some form of planning. Dyna-Q is an approach which does this by simply remembering past transitions and re-using them in the background to refine its value estimates. Arguably, the experience replay table in DQN is a similar form of background planning (the algorithm is essentially the same). However, more sophisticated model-learning and reuse is not generally as successful, and is not seen commonly in practice. See How can we estimate the transition model and reward function?

In general, model-based methods on the same environment can learn faster than model-free methods, since they start with more information that they do not need to learn. However, it is quite common need to learn without having an accurate model available, so there is lots of interest in model-free learning. Sometimes an accurate model is possible in theory, but it would be more work to calculate predictions from the model than to work statistically from the observations.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60