What exactly are partially observable environments?

Question

I have trouble understanding the meaning of partially observable environments. Here's my doubt.

According to what I understand, the state of the environment is what precisely determines the next state and reward for any particular action taken. So, in a partially observable environment, you don't get to see the full environment state.

So, now, consider the game of chess. Here, we are the agent and our opponent is the environment. Even here we don't know what move the opponent is going to take. So, we don't know the next state and reward we are going to get. Also, what we can see can't precisely define what is going to happen next. Then why do we call chess a fully observable game?

I feel I am wrong about the definition of an environment state or the definition of fully observable, partially observable. Kindly correct me.

Could you reference where you have seen chess described as "fully observable"? I usually see the term "a game of perfect information" which is a different term. I think I have an answer for youy, but wantto check I am using the same terminology as you are referencing, because your question is about the use of that terminology in two slightly different fields of study. — Neil Slater, May 22 '21 at 14:45
Note that it's not the "state" that determines the next state, but the dynamics of the environment, which are encoded in the probability distribution $p(s', r \mid s, a)$ in an MDP. — nbro, May 22 '21 at 23:02
@NeilSlater Figure 2.6, p. 45 of AIMA (3rd edition) describes chess as fully observable. In chess, you have access to the state (the configuration of the board), so I guess it's considered fully observable. — nbro, May 22 '21 at 23:09
@NeilSlater I've seen it [here](https://en.wikipedia.org/wiki/Partially_observable_system) — CHANDRASEKHAR HETHA HAVYA, May 23 '21 at 03:08
@nbro: I don't have access to that book. Is that chapter specifically about RL? Game theory, analysis of optimal play and search algorithms could use the term "fully observable" reasonably, but without the same thing applying to RL training to play versus an opponent whose decision-making is unknown. — Neil Slater, May 23 '21 at 07:43

Neil Slater · Answer 1 · 2021-05-23T12:46:29.633

You are correct in the question that in RL terms chess a game of chess where the agent is one player, and the other player has an unknown state is a partially observable environment. Chess played like this is not a fully observable environment.

I did not use the term "fully observable game" or "fully observable system" above , because that is not a reinforcement learning term. You may also read "game of perfect information" which is similar - it means there are no important hidden values in the state of the game which may impact optimal play. This is a different concern to understanding the state of your opponent.

Here is a counter-example showing that games of perfect information are not fully observable systems when you have an opponent with an unknown strategy:

Optimal play in tic tac toe leads to a forced draw.
Let's define a reward signal from the agent's perspective of +1 for a win, 0 for a draw, and -1 for a loss.
If the agent's opponent always plays optimally, then a RL agent will learn to counter that optimal play and also play optimally. All action choices will have an expected return of 0 or -1, and the agent will choose the 0 options when acting greedily.
If the agent's opponent can make a mistake that allows the agent to win, then there will be a trajectory through the game with a return of 1, or perhaps some other postive value in cases where the mistake is only made according to random chance.
The value of states in the game therefore depends on the opponent's strategy.
The opponent's strategy is however not observable - it is unknown and not encoded into the board state.

This should match your intuition when asking the question.

Why then, do many two player game-playing reinforcement agents for games like chess perform well, without using POMDPs?

This is because game theory on these environments supports the concept of "perfect play", and agents that assume their opponent will also attempt to play optimally - without mistakes - will usually do well. Game theory analyses choices leading to forms of the minimax theory - making a choice that your opponent is least able to exploit.

That does mean that such an agent may in fact play sub-optimally against any given opponent. For example, they could potentially turn some losing or draw situations into a win, but have little or no capability to do so unless trained against that kind of opponent. Also, playing like this may be a large risk against other opponents, it may involve playing sub-optimally at some earlier stage.

I have observed a related issue in Kaggle's Connect X competition. Connect 4 is a solved game where player one can force a win, and the best agents are all perfect players. However, they are not all equal. The best performers tweak their agent's choices for player two, to force the highest number of wins against other agents who have not coded a perfect player one. Different kinds of learning strategy lead to different imperfections, and the top of the leaderboard is occupied by the current best perfect agent that also manages to exploit the population of near-perfect agents below it in the rankings. This difference in the top-ranking agents is only possible due to the partially-observable nature of the Connect 4 game played against agents with unknown policies.

What exactly are partially observable environments?

They are environments where in at least some states, the agent does not have access to information that affects the distribution of next state or reward.

Chess played against an opponent where you have a model of their behaviour - i.e. their policy - is fully observable to the agent. This is implicitly assumed by self-play agents and systems, and can work well in practice.

Chess played against an opponent without a model of their behaviour is partially observable. In theory, you could attempt to build a system using a partially observable MDP model (POMDP) to account for this, in an attempt to try and force an opponent into states where they are more likely to make a decision that is good for the agent. However, simply playing optimally as possible in response to all plays by the opponent - i.e. assuming their policy is the same near optimal one as yours even after observing their mistake - is more usual in RL.

The original Alpha Go actually had a separate policy network for its own choices and modelling those of humans. This was selected experimentally as performing slightly better than assuming human opponents used the same policy as the self-play agent.

I'd argue that "the game of chess", with no additional qualifiers, is a fully observable, multi-agent environment. But if you want to apply single-agent RL algorithms, you have to pretend that it's a single-agent environment which you can do by "baking the opponent into the environment". I'd name that environment differently though, like "the game of chess with X as opponent", and then that might be a partially observable environment (or one that's still fully observable, but stochastic, if you know the opponent's policy!). I like both your and nbro's answer, depending on perspective. — Dennis Soemers, May 23 '21 at 12:50
@DennisSoemers I think I have been careful to add that condition (on knowledge about the opponent) everywhere I make a statement about chess being or not being partially observable. If I have missed one, or it is not clear anywhere please identify for me and I will try to fix it. — Neil Slater, May 23 '21 at 12:53
Urmm no don't think you necessarily missed it anywhere. Just thought it was a useful point to clarify / emphasise in a comment :D — Dennis Soemers, May 23 '21 at 13:11
Why do you need to know the opponent's strategy to make it fully observable? It would be like knowing the dynamics of the environment, which do not need to be encoded in the state to take the optimal action. If the optimal play leads to a draw (like in tic-tac-toe), that's orthogonal to what we're talking about, as well as the def. of the reward. To make it fully observable, you just need to know the state _you_ are in (that's the definition of fully observable). Of course, if you know the opponent's strategy, you can exploit it to lead to a win (rather than a draw), but that's different. — nbro, May 24 '21 at 10:37
It seems that you just shifted the definition of the state from the configuration of the board to also know the strategy of the opponent (i.e. the dynamics of the environment). In that case, yes, chess would generally be partially observable, but I don't understand why the opponent's strategy needs to be _necessarily_ included in the state. To me, it seems that you're just mixing concepts here, but maybe I am wrong (I'm not an expert in this topic). — nbro, May 24 '21 at 10:37
I have not shifted anything. It's the correct definition for RL environment. Whilst optimal play in a game theory sense would not need it. I think it is the OP who has mixed concepts, and this answer tries to correct that — Neil Slater, May 24 '21 at 11:24
@nbro: Another way to put it. **RL**: I don't know what the probilities are of opponent taking each kind of move, that means my MDP model is incomplete, and I need to do something about it. **Game Theory**: I don't care what the opponent does, I am going to choose a move where they will get the least benefit no matter what they decide. **RL** [addressed to Game Theory] - I'll use your assumptions then, that's going to be good most of the time. — Neil Slater, May 24 '21 at 11:45
The MDP is not "incomplete", it's "unknown". That's what most RL algorithms assume anyway. The MDP could also be stochastic, i.e. the strategy of the opponent could change over time. Can you please cite anyone (e.g. the Sutton and Barto's book or any other paper) that claims that chess is partially observable because we don't know the strategy of the opponent? I would like to read more about it. — nbro, May 24 '21 at 11:48
@nbro: Agreed it is unknown if you train against the target opponent. It would be incomplete in simulation. So I have made an unstated assumption that we are talking about training a chess-playing RL in simulation or self play. If you have the actual human to train against, then conceptually the MDP is complete and learnable. But that is not what any RL self-play board game playing systems do — Neil Slater, May 24 '21 at 11:50
Sutton & Barto chapter 1 covers this, discussing tic tac toe. More accurately it is one of the working problems, but I do have the official answers to chapter 1 (from Sutton's office) and it matches my analysis here. I will see if I can find something linkable . . . — Neil Slater, May 24 '21 at 11:53
Best I have found so far is section 11.3 of https://www.cp.eng.chula.ac.th/~vishnu/gameResearch/AI/Ghory04%20RL%20in%20board%20game.pdf which indirectly references the issue, but does not directly use the term "partially observable" to describe it. I have to stop for now — Neil Slater, May 24 '21 at 12:01
I've read this discussion and comments again, and I realize that I said "The MDP could also be **stochastic**, i.e. the strategy of the opponent could change over time", while I clearly meant **non-stationary**. So, if the opponent's strategy is fixed, you can model the problem as an MDP and the fixed strategy of the opponent can be encoded into the dynamics of the MDP. However, if the strategy changes, then I think you could model the problem as a non-stationary MDP (but I am not sure how useful this might be), so I guess you wouldn't need to model it as POMDP, even for the self-play case. — nbro, Jan 07 '22 at 18:12
I understand your statements about the game-theoretic view that assumes that the opponent will play optimally, so we will never lose but we may also not win even if the opponent makes a mistake. So, as Dennis also pointed out, these games with other agents, like chess, can be modeled as a multi-agent system, or we can view the opponents as part of the environment too. — nbro, Jan 07 '22 at 18:12

nbro · Accepted Answer · 2021-05-22T23:38:02.677

First, note that the current state does not determine the next state. What determines the next state are the dynamics of the environment, which, in the context of reinforcement learning and, in particular, MDPs, are encoded in the probability distribution $p(s', r \mid s, a)$. So, if the agent is in a certain state $s$, it could end up in another state $s'$, but this is not only determined by being just in $s$, but also by $a$ (the action that you take in $s$) and $p$ (the dynamics of the environment).

Now, in their 3rd edition of the AIMA book, Russell and Norvig define fully observable environments as follows.

Fully observable vs. partially observable: If an agent's sensors give it access to the complete state of the environment at each point in time, then we say that the task environment is fully observable. A task environment is effectively fully observable if the sensors detect all aspects that are relevant to the choice of action; relevance, in turn, depends on the performance measure. Fully observable environments are convenient because the agent need not maintain any internal state to keep track of the world. An environment might be partially observable because of noisy and inaccurate sensors or because parts of the state are simply missing from the sensor data—for example, a vacuum agent with only a local dirt sensor cannot tell whether there is dirt in other squares, and an automated taxi cannot see what other drivers are thinking. If the agent has no sensors at all then the environment is unobservable.

This definition is also the common definition used in reinforcement learning. So, to determine whether an environment is fully or partially observable, you need to determine whether you have or not full access to a (Markovian) state or what constitutes a state in your case. You can consider chess fully observable because you have access to the configuration of the board, so, in theory, you can take optimal action by considering all possible moves of the opponent (but, of course, in practice, not even AlphaZero does this). See figure 2.6, p. 45, of the AIMA book (3rd edition) for more examples of fully and partially observable environments.

A partially observable MDP (POMDP) is a mathematical framework that can be used to model partially observable environments, where you maintain a probability distribution over the possible current (or next) state that the agent may be in.

I have provided the alternative view. Your use and definition of "partially observable" is fine IMO. What is incorrect is linking "game of perfect information" with "full observable" without considering variation in the opponent. What is going on with self-play chess players is slightly more complex, they solve the partially observable nature in a specific way. — Neil Slater, May 23 '21 at 12:40

score 0 · Answer 3 · answered May 23 '21 at 06:04

A partially observable environment means it is from the agent's perspective that the agent observes the environment partially. At every time step, the agent takes action based on this partial observation. Based on the agent's action, the state of the environment changes, but the agent may not know all the changes.

What exactly are partially observable environments?

3 Answers3