I would like to employ DQN to solve a constrained MDP problem. The problem has constraints on action space. At different time steps till the end, the available actions are different. It has different possibilities as below.
- 0, 1, 2, 3, 4
- 0, 2, 3, 4
- 0, 3, 4
- 0, 4
Does this mean I need to learn 4 different Q networks for these possibilities? Also, correct me if I am wrong, it looks like if I specify the action size is 3, then it automatically assumes the actions are 0, 1, 2
, but, in my case, it should be 0, 3, 4
. How shall I implement this?