5

From the AlphaGo Zero paper, during MCTS, statistics for each new node are initialized as such:

${N(s_L, a) = 0, W (s_L, a) = 0, Q(s_L, a) = 0, P (s_L, a) = p_a}$.

The PUCT algorithm for selecting the best child node is $a_t = argmax(Q(s,a) + U(s,a))$, where $U(s,a) = c_{puct} P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s, a)}$.

If we start from scratch with a tree that only contains the root node and no children have been visited yet, then this should evaluate to 0 for all actions $a$ that we can take from the root node. Do we then simply uniformly sample an action to take?

Also, during the expand() step when we add an unvisited node $s_L$ to the tree, this node's children will also have not been visited, and we run into the same problem where PUCT will return 0 for all actions. Do we do the same uniform sampling here as well?

sb3
  • 137
  • 7
  • 1
    This is a very good question. Related to this is the backprop step (which influences the $N(s,b)$ statistic). The original paper states that the backprop step updates for all time steps $t \le L$, but it doesn't make sense to update $(s_L, a_L)$ because $s_L$ is the leaf node that we just expanded: what is $a_L$, do we uniformly pick an action at random? Or does the author mean to backprop for all time steps $t < L$ instead? This code (https://github.com/suragnair/alpha-zero-general/blob/master/MCTS.py) does backprop only on $t < L$, not $t \le L$. – user3667125 Dec 30 '20 at 07:51
  • 2
    Additionally, (to answer your question) the code above adds a small epsilon $\epsilon = 1e-8$ to $\sum_b N(s,b)$ so that $U(s,a) \neq 0$ and does the selection proportional to the prior $P(s,a)$, but I cannot verify if the AlphaZero authors do it this way as well, so I cannot submit that as an answer. – user3667125 Dec 30 '20 at 07:57
  • While I can't verify whether the original paper does this or not, backprop'ing $t – sb3 Dec 30 '20 at 19:28
  • It's a bit more complicated, because AlphaZero's MCTS algorithm is a modified version of a true MCTS algorithm (AlphaZero doesn't actually use a true MCTS because it doesn't use Monte-Carlo simulations to roll out the entire game). The true MCTS algorithm does select a node at random, so it would make sense for AlphaZero to also select a $a_L$ at random if it stays true to MCTS. See https://www.analyticsvidhya.com/blog/2019/01/monte-carlo-tree-search-introduction-algorithm-deepmind-alphago/ for more info on how the true MCTS selects the first node. – user3667125 Dec 30 '20 at 23:03
  • I did some more research, and found the answer. See my post below. – user3667125 Dec 30 '20 at 23:58

2 Answers2

4

I looked at the Python pseudo-code attached to the Data S1 of the Supplementary Materials of the AlphaZero paper. Here is my findings:

  • Contrary to the paper, AlphaZero does not store $\{N(s, a), W(S, a), Q(s, a), P(s, a)\}$ statistics for each edge $(s,a)$. Instead, AlphaZero stores $\{N(s), W(S), Q(s), P(s)\}$ statistics for each node $s$.
  • When a leaf node $S_L$ is expanded, it's visit count, value scores, and action policies are immediately updated in $\{N(s), W(S), Q(s), P(s)\}$, so $N(s)$ is at least $1$. This is why in the paper, the backprop step updates for all time steps $t \le L$ rather than $t < L$. It makes sense to update $s_L$ even though there is no corresponding $a_L$ to pair it with.
  • Therefore, when a new leaf node is expanded, the value $U(s, a)$ of a child of that leaf node will be nonzero, since $\sqrt{\sum_b N(s,b)}$ is actually computed as $N(s_{parent})$ in the code, which is at least 1.
  • Oddly enough, I think there might be a bug in the pseudocode, because at the beginning on the first iteration (starting at the root node), $U(s,a) = 0$ for all child nodes of the root node. This is because at the first iteration, $N(s_{root}) = 0$. The value of all child nodes will be $0$, and since the authors chose to break ties according to Python's max function, the algorithm simply chooses the first element it finds in case of a tie.
  • After the first iteration, $N(s_{root}) > 0$ and so $U(s,a) \neq 0$ and things proceed as normal since the backprop step will have updated the visit count of the root node. So this possible bug/unintuitive behavior only affects the first iteration. It is extremely minor and insignificant, and does not affect the outcome of the MCTS, which is probably why it went unnoticed.
user3667125
  • 1,500
  • 5
  • 13
  • I'm good on everything except the third bullet point, which seems very counterintuitive and not how the authors described it in the paper. It sounds like you're saying that every time we run the PUCT algorithm, $\sqrt{\sum_b N(s, b)}$ uses the visit count of the parent node $s_{L-1}$ and in the denominator, $1 + N(s, a)$ is the visit count of the new leaf node $s_L$? I thought the whole point of the $U$ term in PUCT was to encourage exploration by selecting nodes that have been visited less, in which case the numerator should be counting the total number of visits to children of $s_{L-1}$. – sb3 Dec 31 '20 at 20:24
  • Yes, $\sqrt{\sum_b N(s,b)}$ is actually just the visit count of the parent, and $N(s,a)$ is just the visit count of the current node. If the current selection is looking at a leaf node, then yes the parent would be $s_{L-1}$ and current node would be $s_L$. I was surprised as well because this differs from the equation in the paper, but it's how it is coded. – user3667125 Dec 31 '20 at 20:26
  • To me, this defeats the purpose and intuition behind the PUCT algorithm. How are we encouraging exploration with the $U$ term if in the numerator, we are counting the visits to the parent node $s_{L-1}$? Or maybe it's only the denominator that matters for exploration? – sb3 Dec 31 '20 at 20:28
  • The denominator is what encourages exploration. All the sibling nodes have the same numerator (since they have the same parent), but the ones that are less visited have a smaller denominator, so the overall fraction is larger. – user3667125 Dec 31 '20 at 20:29
  • I guess it still does what it was intended to do, albeit not quite the way the algorithm is described in the paper. I wish this was clarified but, oh well. – sb3 Dec 31 '20 at 20:31
  • Follow up question. When starting with just the root node, since there is no parent, how will the denominator of the $U$ term be calculated? Is this where adding an epsilon comes into play? EDIT: Nvm, I think you answered this in the 4th bullet point. – sb3 Dec 31 '20 at 20:40
  • MCTS never does the selection phase on the root node. The root node is expanded (i.e. $(v, \pi_{s_0})$ is computed) upon initialization. So on the first iteration, MCTS begins by computing the selection phase over the children of the root node. So it starts with $s_1$, and we can compute $\sqrt{\sum_b N(s,b)}$ and $N(s,a)$ normally from there. – user3667125 Dec 31 '20 at 20:46
  • Last question. When self-play returns a tuple $(s_t, \pi_t, R_t)$ for the learn() step, I believe the stochastic policy $\pi_t$ that is returned is $\frac{N(s, \cdot)}{\sqrt{\sum_b N(s, b)}}$. Based on what we discussed above, is it safe to interpret this as the visit count to some arbitrary child node divided by the visit count to the parent node? – sb3 Dec 31 '20 at 22:37
  • By the way, I just realized that the sum of the visits to all children nodes of some parent node is inherently equivalent to the visit counts to that parent node. So I think the algorithm is implemented correctly after all! – sb3 Dec 31 '20 at 22:41
  • Yes to your question. And regarding the last comment -- $N(s_{parent}) = \sum_i N(s_{child_i}) + 1$. This is because when the parent node is just expanded (the second bullet in my answer), the backprop step updates the visit count of that parent node, so $N(s_{parent})$ starts at value $1$. Then the visit count is incremented at each child visit. So that's why it always has a $+1$ in there. The only exception to this is the root node, which doesn't have the $+1$ term, which is why in my fourth bullet I think it's a bug. – user3667125 Dec 31 '20 at 23:18
  • @user3667125 I don't think it's a bug, when the Root Node has a visit count of 0, you aren't even going to loop over the children anyway, the Root Node is a leaf so the recursion ends right there. You're first going to expand the Root Node (Calling the NN on the Root Node for the first time, to give it a policy and initial Q). Then, the visit count will be 1, and on the next iteration you'll be able to select a child properly. – Nicholas Pipitone Feb 08 '22 at 12:43
0

I had the same question and found this page, then I also started to look at the pseudo-code of Supplementary Materials. The code of the computation of the UCB score is as followed:

# The score for a node is based on its value, plus an exploration bonus based on
# the prior.
def ucb_score(config: AlphaZeroConfig, parent: Node, child: Node):
  pb_c = math.log((parent.visit_count + config.pb_c_base + 1) /
                  config.pb_c_base) + config.pb_c_init
  pb_c *= math.sqrt(parent.visit_count) / (child.visit_count + 1)

  prior_score = pb_c * child.prior
  value_score = child.value()
  return prior_score + value_score

I was very confused since it is different from the formula described in the paper, until I found the UCB formula in the paper of MuZero:

enter image description here

It is clear that the pseudocode of AlphaZero mistook the UCB formula of MuZero.

As for the original question, the sum of all the visit count numbers of the childen is implemented as the visit count of the parent node, which is at least 1, since it is visited as it is expanded (evaluated), as you can see in the code of the search:

# Core Monte Carlo Tree Search algorithm.
# To decide on an action, we run N simulations, always starting at the root of
# the search tree and traversing the tree according to the UCB formula until we
# reach a leaf node.
def run_mcts(config: AlphaZeroConfig, game: Game, network: Network):
  root = Node(0)
  evaluate(root, game, network)
  add_exploration_noise(config, root)

  for _ in range(config.num_simulations):
    node = root
    scratch_game = game.clone()
    search_path = [node]

    while node.expanded():
      action, node = select_child(config, node)
      scratch_game.apply(action)
      search_path.append(node)

    value = evaluate(node, scratch_game, network)
    backpropagate(search_path, value, scratch_game.to_play())
  return select_action(config, game, root), root
leodeng
  • 1
  • 1