4

I am in the process of implementing the DQN model from scratch in PyTorch with the target environment of Atari Pong. After a while of tweaking hyper-parameters, I cannot seem to get the model to achieve the performance that is reported in most publications (~ +21 reward; meaning that the agent wins almost every volley).

My most recent results are shown in the following figure. Note that the x axis is episodes (full games to 21), but the total training iterations is ~6.7 million.

enter image description here

The specifics of my setup are as follows:

Model

class DQN(nn.Module):
    def __init__(self, in_channels, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
        self.fc1 = nn.Linear(in_features=64*7*7 , out_features=512)
        self.fc2 = nn.Linear(in_features=512, out_features=outputs)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x    # return Q values of each action

Hyperparameters

  • batch size: 32
  • replay memory size: 100000
  • initial epsilon: 1.0
  • epsilon anneals linearly to 0.02 over 100000 steps
  • random warmstart episodes: ~50000
  • update target model every: 1000 steps
  • optimizer = optim.RMSprop(policy_net.parameters(), lr=0.0025, alpha=0.9, eps=1e-02, momentum=0.0)

Additional info

  • OpenAI gym Pong-v0 environment
  • Feeding model stacks of 4 last observed frames, scaled and cropped to 84x84 such that only the "playing area" is visible.
  • Treat losing a volley (end-of-life) as a terminal state in the replay buffer.
  • Using smooth_l1_loss, which acts as Huber loss
  • Clipping gradients between -1 and 1 before optimizing
  • I offset the beginning of each episode with 4-30 no-op steps as the papers suggest

Has anyone had a similar experience of getting stuck around 6 - 9 average reward per episode like this?

Any suggestions for changes to hyperparameters or algorithmic nuances would be greatly appreciated!

nbro
  • 39,006
  • 12
  • 98
  • 176
Mink
  • 191
  • 5
  • Hi @Mink , I am working on the same project right now, but my average score is capped at -10. Some differences I can see between our implementations are: * "PongDeterministic-v4" environment which takes care of frame skipping. * did not crop the score part on the top, but my frames are 84 x 84. * do not treat losing a volley as a terminal state. * replay buffer size is 50000. RAM fills up and overflows if I use any more than that. No idea how you managed to store 100000 experience tuples. * using l2 loss, was planning to switch to huber loss though. Do you have a Github repo? – hridayns Jan 26 '19 at 12:48
  • 1
    @hridayns, to prevent memory overflow, I save the images as uint8 type in my replay buffer and cast (and divide by 255) right before I forward pass it to my model. – Mink Jan 28 '19 at 01:50
  • By the way, the GitHub repo is https://github.com/MatthewInkawhich/learnRL – Mink Jan 28 '19 at 01:52
  • I have done the same uint8 conversion. I was planning to normalize them by 255 too, but I believe they take up the same memory, whether divided or not. I was thinking I might lose information if I did that. What do you think? Also, I have followed you on github. In regards to your issue, I was thinking maybe it has something to do with your target network update frequency. It could be too low, causing instability during learning? Another thing you could try is a Prioritized Replay buffer. – hridayns Jan 28 '19 at 11:42
  • @hridayns thanks for the suggestions, I will try that. Also, I would recommend always normalizing your image inputs between 0 and 1, as it tends to play nicer with the initial weights and activation function. – Mink Jan 28 '19 at 14:08
  • Thanks. I will probably implement that then. Along with better initialization schemes for the networks, gradient and reward clipping. You can check out my repo here if you like: https://github.com/hridayns/Reinforcement-Learning/ – hridayns Jan 29 '19 at 10:23
  • According to this answer [here](https://stackoverflow.com/a/38363141/3623131), using a squared loss function with gradient clipping is equivalent to using a Huber loss function without the gradient clipping. Gradient clipping step may not be required for you since you are using something close to Huber loss already. – hridayns Jan 29 '19 at 14:21

1 Answers1

1

Incase you still haven't been able to resolve the problem, here's a link to the answer to my own question, which has the step-wise changes I made to achieve a +18 average score saturation using just a 10000 replay buffer size and a normal Double DQN (DDQN), trained for about 700-800 episodes. The updated code can also be found here.

No fancy changes like Prioritized Replay Buffer or any secret hyperparameter changes are required. It's usually something simple, like a small problem with the input preprocessing step.

hridayns
  • 223
  • 2
  • 12