6

I'm building a really simple experiment, where I let an agent move from the bottom-left corner to the upper-right corner of a $3 \times 3$ grid world.

I plan to use DQN to do this. I'm having trouble handling the starting point: what if the Q network's prediction is telling the agent to move downward (or leftward) at the beginning?

Shall I program the environment to immediately give a $-\infty$ reward and end this episode? Will this penalty make the agent "fear" of moving left again in the future, even if moving left is a possible choice?

Any suggestions?

nbro
  • 39,006
  • 12
  • 98
  • 176
o_yeah
  • 197
  • 7
  • [Here](https://ai.stackexchange.com/q/2980/2444) is a similar question in the context of REINFORCE and [here](https://ai.stackexchange.com/q/3403/2444) is another similar question but in the context of a specific reward function (and DQN). – nbro Nov 14 '20 at 18:06

1 Answers1

3

In a toy environment, this is a choice you can make relatively freely, depending on what you want to achieve with the learning challenge.

It may help if you think through what the actual consequences for making the "wrong" move are in your environment. There are a few self-consistent options:

  • The move simply cannot be made and count as playing the game as intended. In which case, do not allow the agent to make that choice. You can achieve that by filtering the list of choices that the agent is allowed to make. In DQN that will mean supplying an action mask to the agent based on the state so it does not include the action at the stage it makes a choice. This "available actions" function is usually coded as part of the environment.

  • The move can be attempted, but results in no change to state (e.g. the agent bumps into a wall). If the goal is to reach a certain state in shortest possible time, then you will typically have 0 reward and a discount factor, or negative reward for each attempted action. Either way, the agent should learn that the move was wasted and avoid it after a few iterations.

  • The move can be attempted, but results in disaster (e.g. the agent falls off a cliff). This is the case where a large negative reward plus ending the episode is appropriate. However, don't use infinite rewards, positive or negative, because that will cause significant problems with numeric stability. Simply large enough to offset any interim positive rewards associated with that direction should be adequate. For a simple goal-seeking environment with no other positive rewards than reaching the goal, ending the episode early is already enough.

When you don't have a toy environment where you get to decide, then the three basic scenarios above can still help. For instance, in most board games we are not interested in having the agent learn the rules for valid moves when they are already supplied by the environment, so the first scenario applies - only select actions from the valid ones as provided by the environment.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thanks! In my toy project, the first solution is appropriate. What should I do if I want to add a mask to prevent the agent to behave "downward" and "leftward"? My current solution is to end this episode immediately if the agent makes those actions and gives a fairly large negative reward. I think an action mask is better than my current solution. How can I implement the action mask in the environment? My DQN network inputs states and outputs Q(s, a). – o_yeah May 20 '20 at 17:23
  • Will it be similar to adding a manual drop-out, and keeps the out-put "downward" and "leftward" neurons always inactive when in this beginning state? – o_yeah May 20 '20 at 17:25
  • @o_yeah: It depends on how you have implemented the neural network. But either way, the filtering is done outside of the neural network, by the learning algorithm when it selects an action. If your NN takes s,a as input in a mini-batch to find the best action, then filter the list of action choices before making the mini-batch. If your NN outputs an array of Q values for every action, write code to somehow ignore values of non-valid actions (e.g. set value to -inf before argmax). In both cases you will need a function defined in the environment that tells you what the valid action choices are. – Neil Slater May 20 '20 at 17:56