What part of the game is the value network trained to predict a winner on?

Question

The Alpha Zero (as well as AlphaGo Zero) papers say they trained the value head of the network by "minimizing the error between the predicted winner and the game winner" throughout its many self-play games. As far as I could tell, further information was not given.

To my understanding, this is basically a supervised learning problem, where, from the self-play, we have games associated with their winners, and the network is being trained to map game states to the likelihood of winning. My understanding leads me to the following question:

What part of the game is the network trained to predict a winner on?

Obviously, after only five moves, the winner is not yet clear, and trying to predict a winner after five moves based on the game's eventual winner would learn a meaningless function. As a game progresses, it goes from tied in the initial position to won at the end.

How is the network trained to understand that if all it is told is who eventually won?

Neil Slater · Accepted Answer · 2018-09-13T07:17:41.703

To my understanding, this is basically a supervised learning problem, where from the self play we have games associated with their winners, and the network is being trained to map game states to likelihood of winning.

Yes, although the data for this supervised learning problem was provided by self-play. As AlphaZero learned, the board evaluations of the same positions would need to change, so this is a non-stationary problem, requiring that the ML forgot the training for older examples over time.

What part of the game is the network trained to predict a winner on?

Potentially all of it, including the starting empty board. I am not sure if the empty board was evaluated in this way, but it is not only feasible, but can be done accurately in practice for simpler games (Tic Tac Toe and Connect 4 for example), given known player policies.

Obviously after only five moves, the winner is not yet clear, and trying to predict a winner after five moves based on the game's eventual winner would learn a meaningless function.

Not at all. This is purely a matter of complexity and difficultly. In practice at such an early stage, the value network will output something non-committal, such as $p=0.51$ win chance for player 1. And it will have learned to do this, because in its experience during self-play similar positions at the start of the game lead to almost equal numbers of player 1 and player 2 winning.

The function is not meaningless either, it can be used to assess results of look-ahead searches without needing to play to the end of the game. It completely replaces position-evaluation heuristics as used in more traditional game tree searches. In practice, very early position data in something as complex as chess or go is not going to be as useful as later position evaluations, due to ambivalent predictions. However, for consistency it can still be learned and used in the game algorithms.

How is the network trained to understand that, if all it is told is who eventually won?

If a supervised learning technique is given the same input data $X$ that on different examples predicts the labels $A, B, B, B, A, A, B, B$, then it should learn $p(B|X) = 0.625$. That would minimise a cross-entropy loss function, and is what is going on here.

score 0 · Answer 2 · answered Sep 13 '18 at 06:19

0

What part of the game is the network trained to predict a winner on

The positional evaluation. How to give a static score to a chess position.

answered Sep 13 '18 at 06:19

ABCD

1,411
1
8
14

What part of the game is the value network trained to predict a winner on?

2 Answers2