Should illegal moves be excluded from loss calculation in DQN algorithm?

Question

I'm implementing DQN algorithm to train my agent to play a turn-based game. The action space for the game is small, but not all moves are available at all the states. Therefore, when deciding on which action to pick, agent sets Q-values to 0 for all the illegal moves while normalizing the values of the rest.

During training, when the agent is calculating the loss between policy and target networks, should the illegal actions be ignored (set to 0) so that they don't affect the calculations?

score 3 · Accepted Answer · answered Jun 27 '20 at 22:36

I've implemented this exact scenario before; your approach would most likely be successful, but I think it could be simplified.

Therefore, when deciding on which action to pick, agent sets Q-values to 0 for all the illegal moves while normalizing the values of the rest.

In DQN, the Q-values are used to find the best action. To determine the best action in a given state, it suffices to look at the Q-values of all valid actions and then take the valid action with highest Q-value. Setting Q-values of invalid actions to 0 is unnecessary once you have a list of valid actions. Note that you would need that set of valid actions to set invalid Q-values to 0 in the first place, so the approach I'm suggesting is more concise without worsening the performance.

Since the relative order of the Q-values is all that is required to find the best action, there is no need for normalization. Also, the original DQN paper uses $\epsilon$-greedy exploration. Keep in mind to only sample from valid actions in a given state when exploring this way.

During training, when the agent is calculating the loss between policy and target networks, should the illegal actions be ignored (set to 0) so that they don't affect the calculations?

As noted in one of your previous questions, we train on tuples of experiences $(s, a, r, s')$. The definition of the Q-learning update is as follows (taken from line 6.8 of Sutton and Barto):

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[R_{t+1} + \gamma\max\limits_aQ(S_{t+1}, a) - Q(S_t, A_t)\right].$$

The update requires taking a maximum over all valid actions in $s'$. Again, setting invalid Q-values to 0 is unnecessary extra work once you know the set of valid actions. Ignoring invalid actions is equivalent to leaving those actions out of the set of valid actions.

Should illegal moves be excluded from loss calculation in DQN algorithm?

1 Answers1