2

I'm working on reimplementing the MuZero paper. In the description of the MCTS (page 12), they indicate that a new node with associated state $s$ is to be initialized with $Q(s,a) = 0$, $N(s,a) = 0$ and $P(s,a) = p_a$. From this, I understand that the root node with state $s_0$ will have edges with zero visits each, zero value and policy evaluated on $s_0$ by the prediction network.

So far so good. Then they explain how actions are selected, according to the equation (also on page 12):

enter image description here

But for the very first action (from the root node) this will give a vector of zeros as argument to the argmax: $Q(s_0,a) = 0$ and $\sum_bN(s_0,b)=0$, so even though $P(s_0,a)$ is not zero, it will be multiplied by a zero weight.

Surely there is a mistake somewhere? Or is it that the very first action is uniformly random?

Ziofil
  • 128
  • 7

1 Answers1

1

Yeah, it seems that you're right and based on the description of the paper it would indeed behave uniformly random at the very first iteration (or maybe just always deterministically pick whichever action happens to be the first one in the list). I can't find anything that would suggest otherwise in the paper, and also the pseudocode they put on arXiv suggests the same behaviour.

I can't really think of any reason why this would be better than playing according to $P(s, a)$ in the very first iteration though. Maybe it creates a tiny little bit more variety in the tree search results you get? I suppose all the stochasticity that is normally inherent to a vanilla MCTS is no longer present in MuZero (or AlphaZero), because they always run for exactly the same number of iterations, and don't have any sort of random rollouts anymore; this would at least introduce a tiny bit of variation.

On the other hand, I also can't imagine this choice really making a meaningful difference in practice, except maybe if you go down to extremely low iteration counts (like if you run fewer iterations than there are legal actions in the root node).

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • But as the policy network gets better and better, it would make a bit more difference no? When I think of a game of Go, where there are hundreds of actions stemming off the root node, I wouldn't want to waste time exploring bad ones from the start : / – Ziofil Nov 29 '20 at 14:15
  • @Ziofil I feel that it would probably not matter much in practice because this is only an issue the first time you ever visit a node, for any node. And nodes that are "good" nodes are likely to get many more than just a handful of visits anyway because the policy network will keep steering towards it if the network is good (even from nodes multiple levels higher up in the tree). But again, that's just my intuition, only way to be sure is to test it. And theoretically, I certainly don't see any *advantage* to doing it the way it is described either – Dennis Soemers Nov 29 '20 at 15:16