5

Consider the Breakout environment.

We know that the underlying world behaves like an MDP, because, for the evolution of the system, it just needs to know what the current state (i.e. position, speed, and speed direction of the ball, positions of the bricks, and the paddle, etc) is. But, considering only single frames as the state space, we have a POMDP, because we lack in formations about the dynamics [1], [2].

What could happen if we wrongly assume that the POMDP is an MDP and do reinforcement learning with this assumption over the MDP?

Obviously, the question is more general, not limited to Breakout and Atari games.

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

2

What could happen if we wrongly assume that the POMDP is an MDP and do reinforcement learning with this assumption over the MDP?

It depends on a few things. The theoretical basis of reinforcement learning needs the state descriptions to have the Markov property for guarantees of convergence to optimal or approximately optimal solutions. The Markov property is a requirement that the state defines 100% of the controllable variation of reward and next state (given the action) - the rest must be purely stochastic.

An MDP can be "nearly Markov", and a lot of real-world physical systems are like that. For instance, pole-balancing and acrobot tasks can be implemented as physical systems using motors, wheels, joints etc. In those real systems, there are limits to accuracy of measurement of the state, and many hidden variables, such as variable temperature (affecting length of components), friction effects, air turbulence. Those hidden variables taken strictly by formal definition would make the system a POMDP. However, their influence compared to the key state variables is low, and in some cases effectively random from the perspective of the agent. In practice RL works well in the real physical systems, despite state data being technically incomplete.

In Atari games using multiple frame images as states, there are varying degrees of to which those states are already non-Markovian. In general a computer game's state may include many features that are not displayed on the screen. Enemies may have health totals or other hidden state, there can be timers controlling appearance of hazards, and in a large number of games the screen only shows a relatively small window into the total play area. However, the Deep Mind DQN network did well on a variety of scrolling combat and platform games.

One game where DQN did notably badly - no better than a default random player - was Montezuma's Revenge. Not only does that platform puzzler game have a large map to traverse, but it includes components where state on one screen affects results on another.

It is hard to make a general statement about where an MDP with missing useful state information would benefit from being treated as a POMDP more formally. Your question is essentially the same thing expressed in reverse.

The true answer for any non-trivial environment would be to try an experiment. It is also possible to make some educated guesses. The basis for those guesses might be the question "If the agent could know hidden feature x from the state, how different would expected reward and policy be?"

For the breakout example using each single frame as a state representation, I would expect the following to hold:

  • Value estimates become much harder since seeing the ball next to a brick - compared to seeing a ball progressively get closer to a brick over 4 frames - gives much less confidence that it is about to hit that brick and score some points.

  • It should still be possible for the agent to optimise play, as one working strategy is to position the "bat" under the ball at all times. This will mean less precise control over angle of bounces, so I would expect it to perform worse than the four-frame version. However, it should still be significantly better than a default random action agent. A key driver for this observation is that seeing the ball close to the bottom of the screen, and not close to the bat, would still be a good predictor of a low expected future reward (even averaged over chances of ball going up vs going down), hence the controller should act to prevent such states occurring.

nbro
  • 39,006
  • 12
  • 98
  • 176
Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thank you. Theoretically speaking, the RL algorithms that assume the Markov property are not guaranteed to succeed (e.g. SARSA does not converge). However, like in Breakout (and in many other cases), the reduced feature space (i.e. without ball speed) should still allow estimating state values accurately enough to succeed in playing the game. But the algorithm that assumes the Markovian property will never estimate state values exactly because he is trying to solve a problem that cannot be solved with his capabilities: Breakout without considering ball speed is not an MDP. Right? – Marco Favorito Jun 20 '18 at 13:14
  • Right. It's not strictly an MDP over the available state when velocities are hidden. But then neither are side-scrollers without considering the whole map, even when velocities are available. Yet DQN can still attempt them and often does quite well. SARSA and Q-Learning are guaranteed to converge in tabular forms (and a few other assumptions/caveats). SARSA with function approximation has a similar guarantee with some loose bounds on the maximum error at "convergence" (which is not stable/single value but should stay within error bounds). – Neil Slater Jun 20 '18 at 13:20