0

I'm trying to build a DQN to replicate the DeepMind results. I'm doing with a simple DQN for the moment, but it isn't learning properly: after +5000 episodes, it couldn't get more than 9-10 points. Each episode has a limit of 5000 steps but it couldn't reach more than 500-700. I think the problem is in the replay function, which is:

def replay(self, replay_batch_size, replay_batcher):
    j = 0
    k = 0
    replay_action = []
    replay_state = []
    replay_next_state = []
    replay_reward= []
    replay_superbatch = []

    if len(memory) < replay_batch_size:
        replay_batch = random.sample(memory, len(memory))
        replay_batch = np.asarray(replay_batch)
        replay_state_batch, replay_next_state_batch, reward_batch, replay_action_batch = replay_batcher(replay_batch)
    else:
        replay_batch = random.sample(memory, replay_batch_size)
        replay_batch = np.asarray(replay_batch)
        replay_state_batch, replay_next_state_batch, reward_batch, replay_action_batch = replay_batcher(replay_batch)
        
    for j in range ((len(replay_batch)-len(replay_batch)%4)):
        
        if k <= 4:
            k = k + 1              
            replay_state.append(replay_state_batch[j])
            replay_next_state.append(replay_next_state_batch[j])
            replay_reward.append(reward_batch[j])
            replay_action.append(replay_action_batch[j])
            
        if k >=4:                
            k = 0
            replay_state = np.asarray(replay_state)
            replay_state.shape = shape
            replay_next_state = np.asarray(replay_next_state)
            replay_next_state.shape = shape
            replay_superbatch.append((replay_state, replay_next_state,replay_reward,replay_action))

            replay_state = []
            replay_next_state = []
            replay_reward = []
            replay_action = []
                                       
    states, target_future, targets_future, fit_batch = [], [], [], []
    
    for state_replay, next_state_replay, reward_replay, action_replay in replay_superbatch:

        target = reward_replay
        if not done:
            target = (reward_replay + self.gamma * np.amax(self.model.predict(next_state_replay)[0]))

        target_future = self.model.predict(state_replay)

        target_future[0][action_replay] = target
        states.append(state_replay[0])
        targets_future.append(target_future[0])
        fit_batch.append((states, targets_future))

    history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)

    loss = history.history['loss'][0]

    if self.exploration_rate > self.exploration_rate_min:

        self.exploration_rate -= (self.exploration_rate_decay/1000000)
    return loss

What I'm doing is to get 4 experiences (states), concatenate and introduce them in the CNN in shape (1, 210, 160, 4). Am I doing something wrong? If I implement the DDQN (Double Deep Q Net), should I obtain similar results as in the DeepMind Breakout video? Also, I'm using the Breakout-v0 enviroment from OpenAI gym.

Edit

Am I doing this properly? I implemented an identical CNN; then I update the target each 100 steps and copy the weights from model CNN to target_model CNN. Should it improve the learning? Anyway I'm getting low loss.

for state_replay, next_state_replay, reward_replay, action_replay in replay_superbatch:

            target = reward_replay
            if not done:

                target = (reward_replay + self.gamma * np.amax(self.model.predict(next_state_replay)[0]))
            if steps % 100 == 0:

                target_future = self.target_model.predict(state_replay)

                target_future[0][action_replay] = target
                states.append(state_replay[0])
                targets_future.append(target_future[0])
                fit_batch.append((states, targets_future))
                agent.update_net()

        history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)

        loss = history.history['loss'][0]

Edit 2

So as far I understand, this code should work am I right?

if not done:

            target = (reward_replay + self.gamma * np.amax(self.target_model.predict(next_state_replay)[0]))
            target.shape = (1,4)
            
            target[0][action_replay] = target
            target_future = target
            states.append(state_replay[0])
            targets_future.append(target_future[0])
            fit_batch.append((states, targets_future))

        if step_counter % 1000 == 0:

            target_future = self.target_model.predict(state_replay)

            target_future[0][action_replay] = target
            states.append(state_replay[0])
            targets_future.append(target_future[0])
            fit_batch.append((states, targets_future))
            agent.update_net()

    history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)
nbro
  • 39,006
  • 12
  • 98
  • 176
JCP
  • 173
  • 12
  • DQN needs a lot of episodes. Additionally, the training is super unstable. Meaning out of 100 training runs it could happen that 95 are completely unusable and 5 are fine. – Martin Thoma Nov 26 '18 at 06:46
  • So, how can I improve the training stability? Double DQN should improve it? – JCP Nov 27 '18 at 04:32

1 Answers1

2

It looks like on each step, you're calling both self.model.predict and self.model.fit. If you do this, you're going to run into stability problems, since your learning target is moving as you train.

The way the DQN paper gets around this problem is by using 2 Q-networks, $Q$ and $\hat{Q}$, where $\hat{Q}$ is called the target network. The target network's parameters are frozen, and its outputs are used to compute the learning targets for $Q$ (targets_future in your code). Every $C$ training steps (where $C$ is a hyperparameter), the target network $\hat{Q}$ is updated with the weights of $Q$. See Algorithm 1 on Page 7 of the DQN paper for the details of this swap.

  • Could you check the edit I did in the original post? I'll appreciate it so much! – JCP Nov 30 '18 at 21:04
  • 2
    When you're doing the Bellman update, you want to use `self.target_model.predict` rather than `self.model.predict`. Also, 100 steps seems low. You'll want to play around with that parameter but something >1000 is probably better. – Nishant Desai Nov 30 '18 at 21:29
  • So basically I only use the `model` CNN for the action update and the loss update am I right? – JCP Nov 30 '18 at 21:32
  • That's right. `self.model` is the model that you call `.fit` on, and it's also the model you use to select your actions. There's a more detailed explanation of the difference between the two models on this question: https://ai.stackexchange.com/questions/6982/why-does-dqn-require-two-different-networks – Nishant Desai Nov 30 '18 at 21:58
  • I still having a doubt. Should I update only the `target_model` weights every C training steps, or am I doing it right in the last edit I did? Thanks. – JCP Dec 01 '18 at 02:36
  • @JCP I... think your **Edit 2** looks mostly fine. After the `if step_counter % 1000 == 0:`, you have a bunch of code duplicated that's also already running before that condition... this means you're running those duplicated parts twice in a row in the `step_count % 1000 == 0` case? That wouldn't be necessary. Just running all that code once, regardless of whether your step count is a multiple of `1000`, should be fine. Under the condition, you'll only need the new `agent.update_net()` call, which I assume is implemented to copy the weights of `self.model` into `self.target_model`? – Dennis Soemers Dec 01 '18 at 09:42