4

I am now working on training an alphazero player for a board game. The implementation of board game is mine, MCTS for alphazero was taken elsewhere. Due to complexity of the game, it takes a much longer time to self-play than to train.

As you know, alphazero has 2 heads: value and policy. In my loss logging I see that with time, the value loss is decreasing pretty significantly. However, the policy loss only demonstrates fluctuation around its initial values.

Maybe someone here has run into similar problems? I would like to know if its my implementation problem (but then the value loss is decreasing) or just a matter of not enough data.

Also, perhaps importantly, the game has ~17k theoretically possible moves, but only 80 at max are legal at any single state (think chess - a lot of possibles but very few are actually legal at any given time). Also, if MCTS has 20 simulations, then the improved probabilities vector (against which we train our policy loss) will have at most 20 non-zero entries. My idea was that it might be hard for the network to learn such sparse vectors.

Thank you for any ideas!

ytolochko
  • 365
  • 2
  • 5
  • Loss is tricky in RL. Loss non-decrease could be caused by increased rate of exploration, and that would be good thing. AlphaZero class of algorithms converging slow anyway and it seems you have huge action space, which is not helping. May take a lot of time/oscillations until you will see improvement. – mirror2image Apr 24 '19 at 14:40
  • Just out of interest, what game are you training it on. (Also, any update on how it went:) – DukeZhou Nov 08 '19 at 22:17

1 Answers1

2

The loss of the policy head here is really quite different from losses in, for instance, more "conventional" Supervised Learning approaches (where we typically expect/hope to see a relatively steady decrease in loss function).

In this AlphaZero setup, the target that we're updating the policy head towards is itself changing during the training process. When we improve our policy, we expect the MCTS "expert" to also be improved, which may lead to a different distribution of MCTS visit counts, which in turn may lead to a different update target for the policy head from previous update targets. So it's perfectly fine if our "loss" increases sometimes, we may still actually be performing better. The loss is useful for the computation of our gradient, but otherwise it doesn't have much use -- it certainly cannot be used as an accurate indicator of performance / learning progress.

but only 80 at max are legal at any single state (think chess - a lot of possibles but very few are actually legal at any given time). Also, if MCTS has 20 simulations, then the improved probabilities vector (against which we train our policy loss) will have at most 20 non-zero entries.

This can be a problem yes. The fact that the majority of moves are not legal at any point in time is not a problem, but if you only have 20 MCTS simulations for a branching factor of 80... that's certainly a problem. The easiest fix would be to simply keep MCTS running for longer, but obviously that's going to take more computation time. If you cannot afford to do this for every turn of self-play, you could try:

  • using only a low MCTS iteration count for some moves, not adding these distributions to the training data for the policy head
  • using a larger MCTS iteration count for some other moves, and only using the distributions of these moves as training data for the policy head

This idea is described in more detail in Subsection 6.1 of Accelerating Self-Play Learning in Go.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66