3

I have the following code (below), where an agent uses Q-learning (RL) to play a simple game.

What appears to be questionable for me in that code is the fixed learning rate. When it's set low, it's always favouring the old Q-value over the learnt/new Q-value (which is the case in this code example), and, vice-versa, when it's set high.

My thinking was: shouldn't the learning rate be dynamic, i.e. it should start high because at the beginning we don't have any values in the Q-table and the agent is simply choosing the best actions it encounters? So, we should be favouring the new Q-values over the existing ones (in the Q-table, in which there's no values, just zeros at the start). Over time (say every n number of episodes), ideally we decrease the learning rate to reflect that, over time, the values in the Q-table are getting more and more accurate (with the help of the Bellman equation to update the values in the Q-table). So, lowering the learning rate will start to favour the existing value in the Q-table over the new ones. I'm not sure if my logic has gaps and flaws, but I'm putting it out there in the community to get feedback from experienced/experts opinions.

Just to make things easier, the line to refer to, in the code below (for updating the Q-value using the learning rate) is under the comment: # Update Q-table for Q(s,a) with learning rate

import numpy as np
import gym
import random
import time
from IPython.display import clear_output

env = gym.make("FrozenLake-v0")

action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001


rewards_all_episodes = []

for episode in range(num_episodes):
    # initialize new episode params
    state = env.reset()
    
    done = False 
    rewards_current_episode = 0 
    
    for step in range(max_steps_per_episode):
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:])
        else:
            action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a) with learning rate
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
            learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        state = new_state
        rewards_current_episode += reward
        
        if done == True:
            break
        
        
    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    rewards_all_episodes.append(rewards_current_episode)

# Calculate and print the average rewards per thousand episodes
rewards_per_thousands_episodes = np.array_split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("******* Average reward per thousands episodes ************")
for r in rewards_per_thousands_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000
    
# Print updated Q-table
print("\n\n********* Q-table *************\n")
print(q_table)
nbro
  • 39,006
  • 12
  • 98
  • 176
Hazzaldo
  • 279
  • 2
  • 9
  • I'm not aware of any substantial research on learning rate in DQN. For table Q-learning it's fairly obvious that lr should decrease gradually, after each plato of stable solution. – mirror2image May 12 '19 at 06:05

1 Answers1

5

Yes you can decay the learning rate in Q-learning, and yes this should result in more accurate Q-values in the long term for many environments.

However, this is something that is harder to manage than in supervised learning, and might not be as useful or necessary. The issues you need to be concerned about are:

Non-stationarity

The target values in value-based optimal control methods are non-stationary. The TD target derived from the Bellman equation is never quite a sample from the optimal action value until the very final stages of learning. This is an issue both due to iterative policy improvements and the bootstrap nature of TD learning.

Reducing learning rate before you reach optimal control could delay finding the optimal policy. In general you want the learning rate to be just low enough that inaccuracies due to over/undershooting the correct value don't prevent or delay differentiating between actions for whatever the interim policy is.

Policy indifference to accurate action values

Q-learning's policy is based on $\pi(s) = \text{argmax}_{a} Q(s,a)$. To obtain an optimal policy, what you care about is identifying the highest-valued action, $a$. As such, it is possible in some environments to achieve optimal control whilst Q values are far from accurate, provided the highest-valued action still has the highest estimate of expected return.

Whether or not you need highly accurate Q values to determine a policy depends on the environment, how similar actions are to each other. For long-term forecasts (i.e. $\gamma \approx 1$) where individual time steps make little difference, this is a more important detail, as there will be only minor differences between action values.

If your goal is to get highly accurate predictions of action values, then the above does not necessarily apply.

Some deterministic environments can use a high learning rate

This only applies to some, simpler environments, but you should definitely bear this in mind when working with tabular solvers.

For instance, it is possible to apply tabular Q-learning to Tic Tac Toe with a learning rate of $1.0$ - essentially replacing each estimate with a new latest estimate - and it works just fine.

In other, more complex environments, this would be a problem and the algorithm would not converge. But clearly, adding a learning rate decay is not a general requirement.

The learning rate decay schedule is a hyper parameter

There is no generic schedule that could apply to all environments and be equally effective in them. For an optimal approach, you would need to run a search over possible decay schedules, and the most efficient learning rate decay would apply only to the environment that you tested. It would likely apply well enough to similar environments that it could be an advantage to know it in future.

In complex environments, there are other working solutions

In practice, for environments where neural networks are used, an approach called Deep Q Networks (DQN) is used. It is common to use a conservative/low learning rate with this due to stability issues of combining neural networks and off-policy reinforcement learning.

In addition, you will quite often see an adaptive gradient optimiser - e.g. Adam or RMSProp - used with DQN. These already adjust gradient step sizes based on recent gradients observed in the neural network during training, so there is less need to add a dynamic learning rate schedule on top.

Supervised learning challenges, such as ImageNet do sometimes add a learning rate schedule over adaptive optimisers, resulting in further improvements. So it might help with DQN, but the other caveats above could still apply.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Really sorry for the v late reply. I reviewed your answer initially and came across terms I wasn't sure of. So I went away to read up more on RL and watch more video lectures as well, as I realised I had gaps in my knowledge. Anyway, after coming back to read your answer again, it all makes perfect sense. Thank you so much for detailed comprehensive answer, really appreciated, and apologies again for a v delayed answer. – Hazzaldo Jun 03 '19 at 23:57