I am trying to understand how alpha zero works, but there is one point that I have problems understanding, even after reading several different explanations. As I understand it (see for example https://applied-data.science/static/main/res/alpha_go_zero_cheat_sheet.png), alpha zero does not perform rollouts. So instead of finishing a game, it stops when it hits an unknown state, uses the neural network to compute probabilities for different actions as well as the value of this state ("probability of winning"), and then propagates the new value up the tree.
The reasoning is that this is much cheaper, since actually completing the game would take more time then just letting the neural network guess the value of a state.
However, this requires that the neural network is decent at predicting the value of a state. But in the beginning of training, obviously it will be bad at this. Moreover, since the monte carlo tree search stops as soon as it hits a new state, and the number of different game states is very large, it seems to me that the simulation will rarely manage to complete a game. And for sure, the neural network can not improve unless it actually completes a significant number of games, because that is only real feedback that tells the agent if it is doing good or bad moves.
What am I missing here?
The only plausible explanation I can come up with is: If the neural network would be essentially random in the beginning, well then for sure the large number of game states would prevent the tree search from ever finishing if it restarts as soon as it hits a previously unknown game state, so this can not be the case. So perhaps, maybe even if the neural network is bad in the beginning, it will not be very "random", but still be quite biased towards some paths. This would mean that the search would be biased to some smaller set of states among the vast number of different game states, and thus it would tend to take the same path more than once and be able to complete some games and get feedback. Is this "resolution" correct?
One problem I have though with the above "resolution", is that according to the algorithm, it should favor exploration in the beginning, so it seems that in the beginning it will be biased towards choosing previously not taken actions. This makes it even more seem like the tree search will never be able to complete a game and thus the neural net would not learn.