2

For policy evaluation purposes, can we use the Q-learning algorithm even though, technically, it is meant for control?

Maybe like this:

  1. Have the policy to be evaluated as the behaviour policy.
  2. Update the Q value conventionally (i.e. updating $Q(s,a)$ using the action $a'$ giving highest $Q(s',a')$ value)
  3. The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

Am I missing something here, given that I have not seen Q-learning being used anywhere for evaluation purposes?

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

1

For off-policy learning you must have two policies - a behaviour policy and a target policy. If the two policies are the same, then you end up with SARSA, not Q learning.

You cannot use Q learning directly for evaluating a fixed target policy, because it directly learns optimal value function as the target policy, regardless of the behaviour policy. Instead you must use another variant of off-policy learning that can evaluate an arbitrary target policy.

Your suggested algorithm is:

  1. Have the policy to be evaluated as the behaviour policy.
  2. Update the Q value conventionally (i.e. updating $Q(s,a)$ using the action $a'$ giving highest $Q(s',a')$ value)
  3. The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

This will not work for evaluating the behaviour policy. If the behaviour policy was stochastic and covered all possible state/action choices, then it will still be Q learning and converge on the optimal value function - maybe very slowly if the behaviour policy did not get to important states very often.

The "trick" to off-policy is that the environment interaction part uses the behaviour policy to collect data, and the update step uses the target policy to calculate estimated returns. In general for off-policy updates, there can be corrections required to re-weight the estimated returns. However, one nice thing about single-step TD methods is that there are no such additional corrections needed.

So this gives a way to do off-policy TD learning, using an approach called Expected SARSA. To use Expected SARSA, you will need to know the distribution of action choices i.e. know $\pi(a|s)$ for the target policy.

This is the variant of your description that will work to evaluate your target policy $\pi(a|s)$:

  1. Have any stochastic policy that "covers" the target policy as the behaviour policy.
  2. Update the Q value using Expected SARSA $Q(s,a) = Q(s,a) + \alpha(r + \gamma [\sum_{a'} \pi(a'|s')Q(s',a')] - Q(s,a))$
  3. The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

Worth noting that Expected SARSA with a target policy of $\pi(s) = \text{argmax}_a Q(s,a)$ is exactly Q learning. Expected SARSA is a strict generalisation of Q learning that allows for learning the value function of any target policy. You may not see it used as much as Q learning, because the goal of learning an optimal value function is more common in practice.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • 1
    Although I don't know the details yet, I think you may also want to take a look at [**off-policy (policy) evaluation**](https://papers.nips.cc/paper/2019/file/4ffb0d2ba92f664c2281970110a2e071-Paper.pdf) (if you are not already aware of these concepts). Can you explain a little why expected SARSA would make this an off-policy evaluation method? – nbro Nov 15 '20 at 23:24
  • @Neil, Thanks for the detailed answer. I'm unclear why ESARSA would work though. I feel that it would only work if the target policy being learnt is the same as the behaviour policy. i.e. we have an on-policy setting. – Dhruv Mullick Nov 16 '20 at 07:06
  • 1
    @DhruvMullick: On-policy vs off-policy is a separate issue from evaluation vs control. The key for off-policy evaluation is that you set the target policy to be the policy that you want to evaluate. Nothing more is required. Can you explain more why you feel evaluation needs to be on-policy? Then I may be able to address that. – Neil Slater Nov 16 '20 at 07:46
  • @NeilSlater, thank you, I understand now. I was earlier under the wrong impression that for Policy evaluation, I would have to set the behaviour policy as the policy to be evaluated. – Dhruv Mullick Nov 16 '20 at 17:45