2

If I understood correctly, the AlphaGo Zero network returns two values: a vector of logit probabilities p and a value v.

My question is: in this vector that it is outputted, do we have a probability for every possible action in the game? If so: does it apply a probability of 0 to actions that are not possible in that particular state? If this is true, how does the network know which actions are valid?

If not: then the network will output vectors of different sizes according to each state. Is this even feasible? And again, how will the network know which actions are valid?

Related questions but none of them covers this question in specific: 1, 2 and 3.

ihavenoidea
  • 255
  • 2
  • 11
  • Not a duplicate, but more directly answers your question: https://ai.stackexchange.com/questions/2980/how-to-handle-invalid-moves-in-reinforcement-learning I recall answering something that would be a duplicate, but cannot find it. Might be on Data Science instead, perhaps . . . anyway, short anser is yes that the NN has every move encoded, and no the NN it never learns "not to play" invalid moves, instead validity is something that the environment is expected to supply as a callable function that the agent can use to filter choices. – Neil Slater Nov 11 '19 at 08:14
  • Thanks for the link @NeilSlater, it is very helpful! One last question, to see if I understood correctly: so the neural network will return probabilities for all possible actions. After I get the vector from the network, I will have to deal with impossible moves and distribute the probabilities assigned to them to all other valid moves (proportionally). Is that right? – ihavenoidea Nov 11 '19 at 14:15
  • That is one way to do it, yes. If you still need details for how to handle it in your case, then keep the question open because I could not find a full duplicate, so perhaps someone will answer here. – Neil Slater Nov 11 '19 at 15:01

0 Answers0