If I understood correctly, the AlphaGo Zero network returns two values: a vector of logit probabilities p and a value v.
My question is: in this vector that it is outputted, do we have a probability for every possible action in the game? If so: does it apply a probability of 0 to actions that are not possible in that particular state? If this is true, how does the network know which actions are valid?
If not: then the network will output vectors of different sizes according to each state. Is this even feasible? And again, how will the network know which actions are valid?
Related questions but none of them covers this question in specific: 1, 2 and 3.