3

I would like to employ DQN to solve a constrained MDP problem. The problem has constraints on action space. At different time steps till the end, the available actions are different. It has different possibilities as below.

  • 0, 1, 2, 3, 4
  • 0, 2, 3, 4
  • 0, 3, 4
  • 0, 4

Does this mean I need to learn 4 different Q networks for these possibilities? Also, correct me if I am wrong, it looks like if I specify the action size is 3, then it automatically assumes the actions are 0, 1, 2, but, in my case, it should be 0, 3, 4. How shall I implement this?

nbro
  • 39,006
  • 12
  • 98
  • 176
ycenycute
  • 341
  • 1
  • 2
  • 6

1 Answers1

2

There are two relevant neural network designs for DQN:

  • Model q function directly $Q(s,a): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, so neural network has concatenated input of state and action, and outputs a single real value. This is arguably the more natural fit to Q learning, but can be inefficient.

  • Model all q values for given state $Q(s,\cdot): \mathcal{S} \rightarrow \mathbb{R}^{|\mathcal{A}|}$, so neural network takes input of current state and outputs all action values related to that state as a vector.

For the first architecture, you can decide which actions to evaluate by how you construct the minibatch. You pre-filter to the allowed actions for each state.

For the second architecture, you must post-filter the action values to those allowed by the state.

There are other possibilities for constructing variable-length inputs and outputs to neural networks - e.g. using RNNs. However, these are normally not worth the extra effort. A pre- or post- filter on the actions for a NN that can process the whole action space (including impossible actions) is all you usually need. Don't worry that the neural network may calculate some non-needed or nonsense values.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • I am using the second architecture, can you elaborate more on how to post-filter the action values? – ycenycute Jan 10 '21 at 18:35
  • @TracyYang: That will depend on your software library. The point is to ensure that they are not considered when choosing an action to take (or looking at the maximum action value from a state for updates). So for example, if you are using a max or argmax function to select, then set the unwanted values to `-inf` before running it. – Neil Slater Jan 10 '21 at 18:44
  • So, the action size is still 5, but when it should be 0,3,4 only, then set values corresponding to 1 and 2 to `-inf` first before using argmax. Is this correct? – ycenycute Jan 10 '21 at 20:42
  • @TracyYang Yes that is correct – Neil Slater Jan 11 '21 at 07:23