Q learning achieves small reward in simple dice game

Question

I am trying to train a Q learning agent on the following game: The states are parametrised by an integer $S \geq 0$ (representing the sum of the previous die rolls). In each step the player can choose to roll a die or quit the game. Whenever the player rolls a six, the game terminates with 0 reward. Otherwise, the player updates $S \to S+$number of eyes. If the player chooses to terminate, they receive $S$ as a reward. The most natural strategy is to choose a threshold $t$ and decide to roll the die whenever $S \leq t$. I believe the optimal $t$ (ie maximising expected value) is $1+2+3+4+5 = 15$. In simulations, the $t=15$ strategy achieves an expected value of around 6. A standard Q learning agent achieves 2.1 reward and a double Q agent (as in the original Hasselt paper) around 2.5. Is such a low performance to be expected? Is there a variant of Q learning that would perform much better? Or did I probably make a mistake in the implementation or choice hyperparameters (learning rate= 0.6, training episodes = 10000, $\gamma$ = 0.95)?

Neil Slater · Answer 1 · 2023-06-22T14:34:59.673

In short, you do have a problem with your hyperparameters, but also single-step Q-Learning appears to struggle a lot with this simple looking environment. The environment seems simple because it is possible to solve it quickly analytically. However, Q Learning does not and cannot analyse the environment, and must learn through trial and error.

This is the optimal action value table for states 0 through 20. It can be calculated quickly (in a fraction of a second using Python) and to high accuracy using value iteration.

State    Roll     Quit  Optimal Choice
    0   6.154    0.000   r
    1   6.532    1.000   r
    2   6.933    2.000   r
    3   7.359    3.000   r
    4   7.810    4.000   r
    5   8.289    5.000   r
    6   8.798    6.000   r
    7   9.340    7.000   r
    8   9.914    8.000   r
    9  10.522    9.000   r
   10  11.161   10.000   r
   11  11.853   11.000   r
   12  12.588   12.000   r
   13  13.361   13.000   r
   14  14.167   14.000   r
   15  15.000   15.000   r,q
   16  15.833   16.000   q
   17  16.667   17.000   q
   18  17.500   18.000   q
   19  18.333   19.000   q
   20  19.167   20.000   q

Any policy with an expected score around $6.1$ from $S=0$ should be close to optimal.

So, what is wrong with your Q-Learning attempt?

The main problem impacting Q-learning is the high variance for Roll action, especially for higher state numbers. This requires a low learning rate, otherwise the "jitter" from the most recent experiences is strong enough that the correct policy never settles. Imagine that you had close to correct action value for $S=14$, with $\hat{q}(14,roll) = 14.2$. And a learning rate of $\alpha = 0.1$. What would that update to if you chose $roll$ in state $14$, and the die roll was a 6? It would change to $\hat{q}(14,roll) = 12.8$ . . . changing for a while what the greedy action choice is.

If you make the learning rate low to combat this, you will need more iterations to converge. A lot more.

The high variance is also a problem when combined with the relatively close optimal values between $roll$ and $quit$ actions around the break-even point. Choosing to roll in $S=14$ is only around 1% better expected return than choosing to quit, and simlar for choosing to quit instead of roll in $S=16$. This should not prevent learning agents getting close to the best scores (after all it is only 1% difference in a later expected value, even if the agent gets it wrong), but you will often see incorrect action choices learned in the $S=12$ to $S=18$ region for that reason.

In addition, a minor thing is that the discount factor $\gamma$ is not a learning hyperparameter. It is part of the problem definition. This changes slightly when considering approximation functions - it is common to pick a value like 0.99 to help stabilise long-term predictions. In your case I think the correct value is $\gamma = 1$

Using single-step Q learning, without any adjustments, I found the following hyper-parameters performed well:

Number of episodes $N = 1000000$
Learning rate $\alpha = 0.01$
Epsilon randomness $\epsilon = 0.25$

This took around 20 seconds to calculate, and was not perfect (typically it will quit on 13+, but otherwise is consistent). That extra time is the cost of not knowing the model or allowing for any of its traits, having to discover the reliability of quitting versus the variance of rolling - separately at each state (because there is no generalisation from a function approximator). Increasing to $N = 10000000$ and decreasing $\alpha = 0.005$ takes over 3 minutes, but gets close to perfect values.

I'm not sure what variants of Q Learning could help. Double Q learning is not going to help much, because that is for dealing with maximisation bias which is not really a problem here. Potentially n-step Q learning could help by making multiple updates at once, which should counter the slow learning rate.

A quick fix to the problem in single-step Q learning is to have a learning rate schedule. If you arrange to exponentially decay learning rate from $1.0$ to $0.005$ over the course of all episodes, then you can reduce the episode count down to 100,000 and get pretty good results. This is because the high learning rates quickly converge on the deterministic returns from choosing $quit$ action, and then the lower learning rates later on allow the action values to stabilise for the $roll$ action.

Thank you, your answer is very instructive! I have been able to replicate your results :) I hadn't appreciated the point that the discount factor is part of the problem definition. I suppose setting it to one is always appropriate in "mathematical" games? I.e. where we control/know all sources of randomness? Why do you choose a constant epsilon instead of a decaying one? — deepfloe, Jun 22 '23 at 06:36
I chose a constant and relatively high epsilon to try to ensure that the high variance roll action was always being assessed and values refined. It probably is not necessary, and you could try something like start at 1, decay down to e.g. 0.1. That would probably mean the q table values for roll actions above state 14 will be inaccurate, but the policy should still be correct (where it isn't the roll action would get chosen again and value refined)\ — Neil Slater, Jun 22 '23 at 06:44

Q learning achieves small reward in simple dice game

1 Answers1