When training a AI RL agent to play a game there'll be situations where the AI cannot perform certain actions lest they violate the game rules. That's easy to handle, and I can set illegal actions to some large negative amount so when doing an argmax they won't be selected. Or if I use softmax I can set probabilities of illegal actions to zero and then re-calculate softmax on the remaining legal states. Indeed, I believe this is what David Silver was referring to when asked this question at a presentation/lecture of AlphaZero:
https://www.youtube.com/watch?v=Wujy7OzvdJk&t=2404s
But doing so changes the output from the network and surely changes things when performing the backprop once a reward is known.
How does one handle that?
Would I set the illegal actions to the mean of the legal actions, or zero...?