3

For SARSA, I know we can estimate the action value $Q(s,a)$, and the relationship between $V(s)$ and $Q(s,a)$ is $V_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s)Q_{\pi} (s,a)$.

So my question is, can we simply estimate $V_{\pi}$ by applying the above equation to the $Q_{\pi}$ that we derived from SARSA? Will there be any restrictions to prevent estimating $V_{\pi}$ through SARSA?

nbro
  • 39,006
  • 12
  • 98
  • 176
Dingzhi Hu
  • 31
  • 1

1 Answers1

3

What you suggest will work, the main restriction is needing to know $\pi$ fully in order to perform the conversion.

If you know that you are going to be estimating $V_{\pi}$ from the start, and have a fixed policy, then you could use basic TD learning instead of SARSA, where the update rule is:

$$V(s) \leftarrow V(s) + \alpha(r + \gamma V(s') - V(s))$$

Doing this would allow you to estimate $V_{\pi}$ from observations without knowing $\pi$.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60