3

I'm new to reinforcement learning.

As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time, $\epsilon$ decreases and we often choose the best action.

  1. I was wondering, e.g. in Q-Learning, if $\epsilon$ is small, e.g. 0.1 or 0.01, do the Q-values really still change? Do they just change their direction, i.e. the best action remains the best action but the Q-values diverge further, or do the values really change again so that the best action always changes for a given state?

  2. If the Q-values really do still change strongly, is it because of the remaining random actions, which we still have at $\epsilon>0$ or would it still change at $\epsilon=0$?

nbro
  • 39,006
  • 12
  • 98
  • 176
  • The question in the title is different than the questions in the body, although they are related. I suggest that you connect the title's question with the body's questions also in the body, so that we understand what your main concern/question is. For example, the current only answer does not address the question in the body, and that's probably because there's no connection in the body. – nbro Oct 02 '20 at 10:37
  • 1
    @nbro yes, exactly why I didn't answer the title of the question. If OP confirms whether his title is his actual question then I can answer that too. – David Oct 02 '20 at 11:09

1 Answers1

2
  1. How much the $Q$-values change does not depend on the value of $\epsilon$, rather the value of $\epsilon$ dictates how likely you are to take a random action and thus take an action that could give rise to a large TD error -- that is a large difference between the returns you expected from taking this action as to what you actually observed. How much the $Q$-value changes depends on the magnitude of this TD error.

  2. $Q$-learning is not guaranteed to converge if there is no exploration. Part of the convergence criteria assumes that each state-action pair will be visited infinitely often in an infinite number of episodes, and so if there is no exploration then this will not happen.

David
  • 4,591
  • 1
  • 6
  • 25