4

I was going through Sutton's book and, using sample-based learning for estimating the expectations, we have this formula

$$ \text{new estimate} = \text{old estimate} + \alpha(\text{target} - \text{old estimate}) $$

What I don't quite understand is why it's called the target, because since it's the sample, it’s not the actual target value, so why are we moving towards a wrong value?

nbro
  • 39,006
  • 12
  • 98
  • 176

2 Answers2

6

It is our "current" target. We assume that the value we get now is at least a closer approximation to the "true" target.

We're not so much moving towards a wrong value as we are moving away from a more wrong value.

Of course, it is all base on random trials, so saying anything definite (such as: "we are guaranteed to improve at each step") is hard to show without working probabilistically. The expectation of the error of the value function (as compared to the true value function) will decrease, that is all we can say.

  • Maybe to clarify your answer, you could add the formula of the "target" and why it represents an estimate of the "true" target. Anyway, welcome to Artificial Intelligence SE :) – nbro Aug 28 '20 at 19:16
  • This isn’t clear to me, why is it closer to the true target? – Chukwudi Ogbonna Aug 28 '20 at 20:02
  • I understand that we use a sample return as the target, but what I don’t get is how it moves closer to the true target, the sample return doesn’t get closer to the target, the mean of those samples is what gets Closer which is our value function, so why do we then call the it target – Chukwudi Ogbonna Aug 28 '20 at 20:16
  • The new value is equal to $(1 - \alpha)(\mathrm{old estimate}) + \alpha(\mathrm{new estimate})$. We're gradually "forgetting" the old estimate with a decay rate of $(1 - \alpha)$. – Robby Goetschalckx Aug 28 '20 at 23:13
0

It would be heplful for me if you specify the section and page number of the Sutton's book. But as far as I understand your question I will try explain this. Think of TD update. The sample contains $(s_t,a_t,r_{t+1},s_{t+1})$. Using incremental update we can write: $$ v_{t}(s) = \frac{1}{t} \sum_{j=1}^{t}(r_{t+1} + \gamma v_{s_{t+1}})$$ $$ v_{t}(s) = v_{t-1}(s) + \alpha (r_{t+1} + \gamma v_{t-1}(s_{t+1}) - v_{t-1}(s_t))$$ We are calling this $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ as the TD target. From the above equation you can already see that $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ is actually the unbaised estimate for $v(s)$. We are calling $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ an unbiased estimate since $E[r_{t+1} + \gamma v_{t-1}(s_{t+1})] = v_t(s_t)$. That means expectation over $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ lead us to true state value function, $v_t(s)$.

For the monte carlo update same explain will be applied. I hope that this answer your question.

Swakshar Deb
  • 673
  • 4
  • 12
  • 1
    I understand the unbiased estimate part, which means by law of large numbers it’ll eventually converge to the optimal value, what I don’t get is why it’s called the target , normally in supervised learning it’s the actual target value, not a sample return , we’re using a sample return as our target TD, which I do not understand – Chukwudi Ogbonna Aug 28 '20 at 19:59
  • In RL targets are not fixed like supervised learning. Here we have moving targets. It is called target in the sense that it is an unbiased estimate of the true value. – Swakshar Deb Aug 28 '20 at 20:08
  • Ok, so why do people say the target will eventually converge, isn’t it the mean of the target that actually converges to the optimal q value, I mean that’s what unbiased estimate means – Chukwudi Ogbonna Aug 28 '20 at 20:20
  • People say this in the sense that initially what you get $r_{t+1} + \gamma v(s_{t+1})$ is not actually an unbiased estimate. We randomly initally all $v(s) = 0$. Then gradually they started to update. Initially, those values do not represent the unbiased estimate. But as time passes those estimate get close to the true unbiased estimate and converges to the unbiased estimate. For this reason they say target will eventually converge to the ubiased estimate. – Swakshar Deb Aug 28 '20 at 20:34
  • Chill I think I get what you mean, we use the estimate r + yV(s’), the unbiased estimate is the formula with the optimal V(s) ie R +yV*(s’), but since we use an estimate instead of the actual value ,so as our V(s) estimate converges to the V*(s) we can then say the target now becomes the unbiased estimate of our true value function ? – Chukwudi Ogbonna Aug 28 '20 at 20:42
  • For TD learning or monte carlo learning we are not finding optimal $v(s)$. We are just evaluating the policy, $\pi$, – Swakshar Deb Aug 28 '20 at 20:48
  • What I mean is the V(s) we initially start with is wrong, so it’s not the unbiased estimate, but as our V(s) converges to the true value, the target now converges to the unbiased estimate – Chukwudi Ogbonna Aug 28 '20 at 20:51
  • When the target converges after that we can say the target is an unbiased estimate of the true value – Swakshar Deb Aug 28 '20 at 20:51
  • When the target converges to the true value, then you can get the true $v(s)$ for each state. – Swakshar Deb Aug 28 '20 at 20:53
  • That makes sense, because our target is dependent on a value function that is initially wrong, it isn’t really the unbiased estimate, but as time goes on and we interact more, our Value function becomes more accurate , and so out TD target converges to the unbiased estimate – Chukwudi Ogbonna Aug 28 '20 at 20:54
  • Yes, you are right. – Swakshar Deb Aug 28 '20 at 20:55
  • Finally, thank you so much, I appreciate your time and effort – Chukwudi Ogbonna Aug 28 '20 at 21:10