Can we stop training as soon as epsilon is small?

Question

I'm new to reinforcement learning.

As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time, $\epsilon$ decreases and we often choose the best action.

I was wondering, e.g. in Q-Learning, if $\epsilon$ is small, e.g. 0.1 or 0.01, do the Q-values really still change? Do they just change their direction, i.e. the best action remains the best action but the Q-values diverge further, or do the values really change again so that the best action always changes for a given state?
If the Q-values really do still change strongly, is it because of the remaining random actions, which we still have at $\epsilon>0$ or would it still change at $\epsilon=0$?

The question in the title is different than the questions in the body, although they are related. I suggest that you connect the title's question with the body's questions also in the body, so that we understand what your main concern/question is. For example, the current only answer does not address the question in the body, and that's probably because there's no connection in the body. — nbro, Oct 02 '20 at 10:37
@nbro yes, exactly why I didn't answer the title of the question. If OP confirms whether his title is his actual question then I can answer that too. — David, Oct 02 '20 at 11:09

score 2 · Answer 1 · answered Oct 01 '20 at 23:52

How much the $Q$-values change does not depend on the value of $\epsilon$, rather the value of $\epsilon$ dictates how likely you are to take a random action and thus take an action that could give rise to a large TD error -- that is a large difference between the returns you expected from taking this action as to what you actually observed. How much the $Q$-value changes depends on the magnitude of this TD error.
$Q$-learning is not guaranteed to converge if there is no exploration. Part of the convergence criteria assumes that each state-action pair will be visited infinitely often in an infinite number of episodes, and so if there is no exploration then this will not happen.

Can we stop training as soon as epsilon is small?

1 Answers1