DQN stuck at suboptimal policy in Atari Pong task

Question

I am in the process of implementing the DQN model from scratch in PyTorch with the target environment of Atari Pong. After a while of tweaking hyper-parameters, I cannot seem to get the model to achieve the performance that is reported in most publications (~ +21 reward; meaning that the agent wins almost every volley).

My most recent results are shown in the following figure. Note that the x axis is episodes (full games to 21), but the total training iterations is ~6.7 million.

The specifics of my setup are as follows:

Model

class DQN(nn.Module):
    def __init__(self, in_channels, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
        self.fc1 = nn.Linear(in_features=64*7*7 , out_features=512)
        self.fc2 = nn.Linear(in_features=512, out_features=outputs)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x    # return Q values of each action

Hyperparameters

batch size: 32
replay memory size: 100000
initial epsilon: 1.0
epsilon anneals linearly to 0.02 over 100000 steps
random warmstart episodes: ~50000
update target model every: 1000 steps
optimizer = optim.RMSprop(policy_net.parameters(), lr=0.0025, alpha=0.9, eps=1e-02, momentum=0.0)

Additional info

OpenAI gym Pong-v0 environment
Feeding model stacks of 4 last observed frames, scaled and cropped to 84x84 such that only the "playing area" is visible.
Treat losing a volley (end-of-life) as a terminal state in the replay buffer.
Using smooth_l1_loss, which acts as Huber loss
Clipping gradients between -1 and 1 before optimizing
I offset the beginning of each episode with 4-30 no-op steps as the papers suggest

Has anyone had a similar experience of getting stuck around 6 - 9 average reward per episode like this?

Any suggestions for changes to hyperparameters or algorithmic nuances would be greatly appreciated!

Hi @Mink , I am working on the same project right now, but my average score is capped at -10. Some differences I can see between our implementations are: * "PongDeterministic-v4" environment which takes care of frame skipping. * did not crop the score part on the top, but my frames are 84 x 84. * do not treat losing a volley as a terminal state. * replay buffer size is 50000. RAM fills up and overflows if I use any more than that. No idea how you managed to store 100000 experience tuples. * using l2 loss, was planning to switch to huber loss though. Do you have a Github repo? — hridayns, Jan 26 '19 at 12:48
@hridayns, to prevent memory overflow, I save the images as uint8 type in my replay buffer and cast (and divide by 255) right before I forward pass it to my model. — Mink, Jan 28 '19 at 01:50
By the way, the GitHub repo is https://github.com/MatthewInkawhich/learnRL — Mink, Jan 28 '19 at 01:52
I have done the same uint8 conversion. I was planning to normalize them by 255 too, but I believe they take up the same memory, whether divided or not. I was thinking I might lose information if I did that. What do you think? Also, I have followed you on github. In regards to your issue, I was thinking maybe it has something to do with your target network update frequency. It could be too low, causing instability during learning? Another thing you could try is a Prioritized Replay buffer. — hridayns, Jan 28 '19 at 11:42
@hridayns thanks for the suggestions, I will try that. Also, I would recommend always normalizing your image inputs between 0 and 1, as it tends to play nicer with the initial weights and activation function. — Mink, Jan 28 '19 at 14:08
Thanks. I will probably implement that then. Along with better initialization schemes for the networks, gradient and reward clipping. You can check out my repo here if you like: https://github.com/hridayns/Reinforcement-Learning/ — hridayns, Jan 29 '19 at 10:23
According to this answer [here](https://stackoverflow.com/a/38363141/3623131), using a squared loss function with gradient clipping is equivalent to using a Huber loss function without the gradient clipping. Gradient clipping step may not be required for you since you are using something close to Huber loss already. — hridayns, Jan 29 '19 at 14:21

score 1 · Answer 1 · answered Feb 10 '19 at 04:04

Incase you still haven't been able to resolve the problem, here's a link to the answer to my own question, which has the step-wise changes I made to achieve a +18 average score saturation using just a 10000 replay buffer size and a normal Double DQN (DDQN), trained for about 700-800 episodes. The updated code can also be found here.

No fancy changes like Prioritized Replay Buffer or any secret hyperparameter changes are required. It's usually something simple, like a small problem with the input preprocessing step.

DQN stuck at suboptimal policy in Atari Pong task

Model

Hyperparameters

Additional info

1 Answers1