1

I am currently trying to teach a (D)DQN algorithm to play a 10x10 GridWorld game, so I can compare the two as I increase the number of moves the agent can take.

The rewards are as follows: A step = -1 key = 100 Door = 100 Death wall = -100

See the setup of the AI in the code.

My problem is, that regardless of what I do, it ends up following the same strategy of just going to the outer walls and staying there. I presume this is because, the outer walls givve the least amount of punishment per step, as the risk of dying is decreased considerably. At the same time, as the moves increase, the chance of ending up at the outer walls increases as well (see the heatmaps) Left: 1 move per direction Right: 8 moves maximum per direction

I've tried the following:

  • Drastically decreasing the decay of the epsilon, such that it reaches its final state only in the last 10% of the training steps.
  • Running 100k moves just to add to the EMR before I start actually counting the steps.
  • Increasing the size of the network
  • Giving out a negative -2 reward for staying in the same tile
  • Feeding it the whole grid as the input vector

None of this has worked. The longest I have trained a single model was for 14 million training steps. Still, same strategy as before.

The way I evaluate the model is by(in this order):

  • At every 1 millionth training step I run 50-100k evaluation steps where I record the outcome of every step
  • Generating a heatmap to see whether or not it remains in the same few places (which are not the key or the door)
  • Running its best policy and visually estimating whether or not it has improved

CODE: https://github.com/BrugerX/IISProject3Ugers

Training is done through the TrainingLoop.py and the evaluation is done through the HeatMapEvaluation.py.

What is it, that I have missed? How come, even after 14 million training steps, that the model still hasn't learnt to memorize the path in GridWorld?

1 Answers1

2

I have two suggestions that you can look into. Based on my own work in RL, I believe the first one will require less work to implement.

  1. If the observability of the environment is not an issue, then you could give the agent a relative measure (distance to the goal) as part of the observation to provide it with knowledge of how far away it is. You can also incorporate this as part of the reward function to put further emphasis on how minimising the distance to the goal is the key objective. In a GridWorld you would use the Manhattan Distance to get a relative distance.

  2. You can apply curriculum learning to make the environment less ambiguous during early training. The principle of this strategy is to provide easier training examples which do not require many random actions to reach the goal. As the agents learn the complexity of the environment is increased and it can transfer knowledge. In practice, this can be done in two ways:

    2.1. You start with a small environment say 3x3 or 5x5 and increase the size after the success rate reaches some satisfying number.

    2.2. When resetting the environment you spawn the agent in close proximity to the goal and increase the distance after the success rate reaches some satisfying number.

The second method is inspired by this paper. I can also highly recommend this blog post from Lilian Weng if you would like to learn more about curriculum learning.

Lars
  • 179
  • 2
  • 9