Questions tagged [alphago-zero]

For questions related to AlphaGo Zero, which is a version of DeepMind's Go software, AlphaGo, that does not use data from human games and it is stronger than AlphaGo. There is a generalized version of AlphaGo Zero called AlphaZero, which beat the 3-day version of AlphaGo Zero by winning 60 games to 40. AlphaGo Zero was introduced in the paper "Mastering the game of Go without human knowledge" (2017) by David Silver et al.

Have a look at the research paper that introduced AlphaGo Zero Mastering the game of Go without human knowledge (2017) by David Silver et al. and https://en.wikipedia.org/wiki/AlphaGo_Zero.

30 questions
15
votes
1 answer

Why does the policy network in AlphaZero work?

In AlphaZero, the policy network (or head of the network) maps game states to a distribution of the likelihood of taking each action. This distribution covers all possible actions from that state. How is such a network possible? The possible actions…
11
votes
1 answer

Why is the merged neural network of AlphaGo Zero more efficient than two separate neural networks?

AlphaGo Zero contains several improvements compared to its predecessors. Architectural details of Alpha Go Zero can be seen in this cheat sheet. One of those improvements is using a single neural network that calculates move probabilities and the…
Demento
  • 1,684
  • 1
  • 7
  • 26
7
votes
3 answers

Would AlphaGo Zero become perfect with enough training time?

Would AlphaGo Zero become theoretically perfect with enough training time? If not, what would be the limiting factor? (By perfect, I mean it always wins the game if possible, even against another perfect opponent.)
6
votes
0 answers

How is the rollout from the MCTS implemented in both of the AlphaGo Zero and the AlphaZero algorithms?

In the vanilla Monte Carlo tree search (MCTS) implementation, the rollout is usually implemented following a uniform random policy, that is, it takes random actions until the game is finished and only then the information gathered is backed up. I…
5
votes
2 answers

What part of the game is the value network trained to predict a winner on?

The Alpha Zero (as well as AlphaGo Zero) papers say they trained the value head of the network by "minimizing the error between the predicted winner and the game winner" throughout its many self-play games. As far as I could tell, further…
5
votes
2 answers

What is the difference between DQN and AlphaGo Zero?

I have already implemented a relatively simple DQN on Pacman. Now I would like to clearly understand the difference between a DQN and the techniques used by AlphaGo zero/AlphaZero and I couldn't find a place where the features of both approaches are…
5
votes
1 answer

What is a "logit probability"?

DeepMind's paper "Mastering the game of Go without human knowledge" states in its "Methods" section on its "Neural network architecture" that the output layer of AlphaGo Zero's policy head is "A fully connected linear layer that outputs a vector of…
4
votes
1 answer

Would it take 1700 years to run AlphaGo Zero in commodity hardware?

From this link, AlphaGo would take millennia to run in regular hardware. They generated 29 million games for the final result, which means it's going to take me about 1700 years to replicate this. Are these calculations correct?
4
votes
1 answer

How Does AlphaGo Zero Implement Reinforcement Learning?

AlphaGo Zero (https://deepmind.com/blog/alphago-zero-learning-scratch/) has several key components that contribute to it's success: A Monte Carlo Tree Search Algorithm that allows it to better search and learn from the state space of Go A Deep…
3
votes
1 answer

How does policy network learn in AlphaZero?

I'm currently trying to understand how AlphaZero works. There is one thing with the training of the AlphaZero's policy head that confuses me. Basically, in AlphaGo Zero's paper (where the major part of AlphaZero algorithm is explained) a combined…
3
votes
1 answer

AlphaGo Zero: does $Q(s_t, a)$ dominate $U(s_t, a)$ in difficult game states?

AlphaGo Zero AlphaGo Zero uses a Monte-Carlo Tree Search where the selection phase is governed by $\operatorname*{argmax}\limits_a\left( Q(s_t, a) + U(s_t, a) \right)$, where: the exploitation parameter is $Q(s_t, a) = \displaystyle…
3
votes
1 answer

Why does AlphaGo Zero select move based on exponentiated visit count?

From the AlphaGo Zero paper, AlphaGo Zero uses an exponentiated visit count from the tree search. Why use visit count instead of the mean action value $Q(s, a)$?
Cash Lo
  • 133
  • 3
3
votes
2 answers

How does the AlphaGo Zero policy decide what move to execute?

I was going through the AlphaGo Zero paper and I was trying to understand everything, but I just can't figure out this one formula: $$ \pi(a \mid s_0) = \frac{N(s_0, a)^{\frac{1}{\tau}}}{\sum_b N(s_0, b)^{\frac{1}{\tau}}} $$ Could someone decode how…
3
votes
1 answer

What is the input to AlphaGo's neural network?

I have been reading an article on AlphaGo and one sentence confused me a little bit, because I'm not sure what it exactly means. The article says: AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous…
3
votes
1 answer

Why is Monte Carlo used as the tree search algorithm for AlphaGo?

Could a better algorithm other than Monte Carlo be used for the AlphaGo computer? Why didn't the DeepMind team think of choosing another kind of algorithm rather than spending time on their neural nets?
1
2