I am now working on training an alphazero player for a board game. The implementation of board game is mine, MCTS for alphazero was taken elsewhere. Due to complexity of the game, it takes a much longer time to self-play than to train.
As you know, alphazero has 2 heads: value and policy. In my loss logging I see that with time, the value loss is decreasing pretty significantly. However, the policy loss only demonstrates fluctuation around its initial values.
Maybe someone here has run into similar problems? I would like to know if its my implementation problem (but then the value loss is decreasing) or just a matter of not enough data.
Also, perhaps importantly, the game has ~17k theoretically possible moves, but only 80 at max are legal at any single state (think chess - a lot of possibles but very few are actually legal at any given time). Also, if MCTS has 20 simulations, then the improved probabilities vector (against which we train our policy loss) will have at most 20 non-zero entries. My idea was that it might be hard for the network to learn such sparse vectors.
Thank you for any ideas!