17

I'm studying reinforcement learning. It seems that "state" and "observation" mean exactly the same thing. They both capture the current state of the game.

Is there a difference between the two terms? Is the observation maybe the state after the action has been taken?

nbro
  • 39,006
  • 12
  • 98
  • 176
echo
  • 673
  • 1
  • 5
  • 12
  • 1
    In the paper [_Imagination-Augmented Agents for Deep Reinforcement Learning_](https://arxiv.org/abs/1707.06203), the authors use the term "observation" instead of "state" on purpose. So, you may want to have a look at it to understand why. Have also a look at [the Wikipedia article on POMDPs](https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process), which is a mathematical framework that generalizes MDPs, where observations and states are clearly differentiated. – nbro Nov 16 '18 at 23:11

2 Answers2

19

Sometimes observation and state overlap completely, which is convenient. However, there is no reason to expect it in all cases, and that's where interesting problems occur.

Reinforcement learning theory is based on Markov Decision Processes. This leads to a formal definition of state. Most importantly, the state must have the Markov property. Which means that for RL to work according to theory, that knowing the state means that you know everything knowable that could determine the response of the environment to a specific action. Everything that remains must be purely stochastic and unknowable in principle until after the action is resolved.

Systems like deterministic or probability-driven games, and computer-controlled simulations can be designed to have easily observable states that have this property. Games with this trait are often called "games of perfect information", although you may have unknown information, provided it is revealed in a purely stochastic manner.

In practice, real world interactions contain far too much detail for any observation to be a true state with the Markov property. For instance, consider the inverted pendulum environment, a classic RL toy problem. A real inverted pendulum would behave differently depending on its temperature, which could vary along its length. The joint and actuators might be sticky. Rotations and movement will alter temperature and friction, etc. However, a RL agent will typically only consider current motion and position of the trolley and pendulum. In this case, the observation of 4 traits is usually good enough, and a state based on this almost has the Markov property.

There are also problems where observations are not enough to make usable state data for a RL system. The Deep Mind Atari DQN paper had examples of a couple of these. The first example is that a single frame lost data about motion. This could be addressed by taking four consecutive frames and combining them to make a single state. It could be argued that each frame is an observation, and that four observations had to be combined in order to construct a more useful state (although this could be put aside as just semantics).

The second example in Atari DQN is that the pixel observations did not include data that the game was tracking but that was not visible on screen. Games with large scrolling maps are a weakness of the Atari-playing DQN, because its state has no memory of screens other than the four used for movement. An example of such a game, where Deep Mind's player did much worse than a human player is Montezuma's Revenge, where to progress it is necessary to remember some off-screen locations.

There are ways to address knowledge that there is unobserved but relevant state in a problem. The general framework for describing the problem is Partially Observable Markov Decision Processes (POMDPs). Workable solutions include adding explicit memory or "belief state" to the state representation, or using a system such as RNN in order to internalise the learning of a state representation driven by a sequence of observations.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thanks for the response! So an observation is a subset from the state? – echo Apr 09 '18 at 21:58
  • 2
    @echo: Not necessarily. You could also observe things that are irrelevant to the state. So it is more accurate to say that state and observation may overlap. Generally, when they overlap a lot, RL is easier to apply, so you try and make this happen - most toy problems in RL are designed so that observation and state overlap perfectly, for convenience. But sometimes interesting problems require you to deal with missing/unobserved state data. – Neil Slater Apr 09 '18 at 22:03
1

There is a subtle but important difference between observation and state.

The observation is the information that the agent is gathering from the environment. This could be data coming from sensors. It could be noisy, or contain redundant information. It could also be incomplete and not contain enough to capture all the information needed to build the state. (In this case the agent might be able to use what it already knows about the current state and the new observation data to build the new state).

The state is the information that describes all the relevant aspects about the environment that the agent's policy needs to make a decision. We also like to make a distinction between the environment state which could be huge and impossible to capture fully (especially in scenarios where the agent is interacting with the physical world) and the agent state which is the distilled version that only captures the important information the agent needs to make a decision.

In simple settings, the observation could be equivalent to the state. In a fully-observable board game (e.g. checkers) the position of the pieces is the observation and also the state. In more complex scenarios this is not the case.

When using Deep RL or most of the modern RL variants, the state is often not explicitly encoded. It is a feature of these modern algorithms to use function approximation to bridge the observation inputs directly into the policy or value function. This is especially useful if the number of unique states is very huge, and you also want your policy to be robust enough to behave correctly even for unseen states that are similar to ones (but not exactly the same) it was trained on.

This is why in frameworks such as Gymnasium you see observations and not states, and probably one of the sources of this confusion between both.

jbx
  • 111
  • 1