8

I am trying to understand how alpha zero works, but there is one point that I have problems understanding, even after reading several different explanations. As I understand it (see for example https://applied-data.science/static/main/res/alpha_go_zero_cheat_sheet.png), alpha zero does not perform rollouts. So instead of finishing a game, it stops when it hits an unknown state, uses the neural network to compute probabilities for different actions as well as the value of this state ("probability of winning"), and then propagates the new value up the tree.

The reasoning is that this is much cheaper, since actually completing the game would take more time then just letting the neural network guess the value of a state.

However, this requires that the neural network is decent at predicting the value of a state. But in the beginning of training, obviously it will be bad at this. Moreover, since the monte carlo tree search stops as soon as it hits a new state, and the number of different game states is very large, it seems to me that the simulation will rarely manage to complete a game. And for sure, the neural network can not improve unless it actually completes a significant number of games, because that is only real feedback that tells the agent if it is doing good or bad moves.

What am I missing here?

The only plausible explanation I can come up with is: If the neural network would be essentially random in the beginning, well then for sure the large number of game states would prevent the tree search from ever finishing if it restarts as soon as it hits a previously unknown game state, so this can not be the case. So perhaps, maybe even if the neural network is bad in the beginning, it will not be very "random", but still be quite biased towards some paths. This would mean that the search would be biased to some smaller set of states among the vast number of different game states, and thus it would tend to take the same path more than once and be able to complete some games and get feedback. Is this "resolution" correct?

One problem I have though with the above "resolution", is that according to the algorithm, it should favor exploration in the beginning, so it seems that in the beginning it will be biased towards choosing previously not taken actions. This makes it even more seem like the tree search will never be able to complete a game and thus the neural net would not learn.

2 Answers2

2

You're right that AlphaGo Zero doesn't perform rollouts in its MCTS. It does complete many, many games, though.

Realize that AlphaGo Zero only iterates MCTS 1,600 times before taking an action. The next state resulting from that action becomes the root of future search trees. Since a typical game of Go only lasts a few hundred moves, the board will very quickly reach a terminal state.

None of this is dependent on the initial performance of the neural network. The neural net can be incredibly bad; actions/moves will still be taken at the same frequency. Of course, since AlphaGo Zero trains with self-play, one of its two selves in a game will usually be the winner (ties are possible). So the neural net will improve over time.

I recommend going over the "Self-Play Training Pipeline" section of the paper.

Philip Raeisghasem
  • 2,028
  • 9
  • 29
0

Sorry this is more of a comment than an answer. I'm wondering if you have found a definitive answer to your question, because I've a very related question.

I'm also confused by the AlphaZero algorithm - my explanation for my confusion is specified here: How does AlphaZero use its value and policy heads in conjunction?.

The thing is that I think the AlphaZero algorithm is also different from the Alphago Zero algorithm. A lot of sources that I've tried referred to really mix the two together.

In particular, there's this function in the official pseudocode, which really confused me:

def run_mcts(config: AlphaZeroConfig, game: Game, network: Network):
  root = Node(0)
  evaluate(root, game, network)
  add_exploration_noise(config, root)

  for _ in range(config.num_simulations):
    node = root
    scratch_game = game.clone()
    search_path = [node]

    while node.expanded():
      action, node = select_child(config, node)
      scratch_game.apply(action)
      search_path.append(node)

    value = evaluate(node, scratch_game, network)
    backpropagate(search_path, value, scratch_game.to_play())
  return select_action(config, game, root), root


def select_action(config: AlphaZeroConfig, game: Game, root: Node):
  visit_counts = [(child.visit_count, action)
                  for action, child in root.children.iteritems()]
  if len(game.history) < config.num_sampling_moves:
    _, action = softmax_sample(visit_counts)
  else:
    _, action = max(visit_counts)
  return action

Mainly because, if you look at the definition of expanded...

  def expanded(self):
    return len(self.children) > 0

I.e. when there are no more moves, and we are at the end of the game. I wonder what I'm missing here.

aquaplane
  • 13
  • 4