How is per-decision importance sampling derived in Sutton & Barto's book?

Question

In per-decison importance sampling given in Sutton & Barto's book:

Eq 5.12 $\rho_{t:T-1}R_{t+k} = \frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}R_{t+k}$

Eq 5.13 $\mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\right] = \displaystyle\sum_ab(a|S_k)\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})} = \displaystyle\sum_a\pi(a|S_k) = 1$

Eq.5.14 $\mathbb{E}[\rho_{t:T-1}R_{t+k}] = \mathbb{E}[\rho_{t:t+k-1}R_{t+k}]$

As full derivation is not given, how do we arrive at Eq 5.14 from 5.12?

From what i understand :

1) $R_{t+k}$ is only dependent on action taken at $t+k-1$ given state at that time i.e. only dependent on $\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}$

2) $\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}$ is independent of $\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}$ , so $\mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}\right] = \mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\right]\mathbb{E}\left[\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}\right], \forall \, k\in [t,T-2]$

Hence, $\mathbb{E}[\rho_{t:T-1}R_{t+k}]= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}R_{t+k}\right] \\= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}....\frac{\pi(A_{t+k-2}|S_{t+k-2})}{b(A_{t+k-2}|S_{t+k-2})}\frac{\pi(A_{t+k}|S_{t+k})}{b(A_{t+k}|S_{t+k})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}R_{t+k}\right] \\= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\right]\mathbb{E}\left[\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}\right]....\mathbb{E}\left[\frac{\pi(A_{t+k-2}|S_{t+k-2})}{b(A_{t+k-2}|S_{t+k-2})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k}|S_{t+k})}{b(A_{t+k}|S_{t+k})}\right]......\mathbb{E}\left[\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}R_{t+k}\right] \\= \mathbb{E}[\frac{\pi_{t+k-1}}{b_{t+k-1}}R_{t+k}]\\=\mathbb{E}[\rho_{t+k-1}R_{t+k}]$

which is not equal to eq 5.14. What's the mistake in the above calculations? Are 1 and 2 correct?

To answer your last question, in general, the expectation of the product of two random variables is not equal to the product of the expectations of each random variable, unless the random variables are independent. That doesn't seem to be true in your case (given that the policies at successive time steps could be correlated), but I could be wrong (because I know little or almost nothing about importance sampling and related topics). — nbro, Jun 13 '20 at 16:50
I don't think policies at successive time steps are correlated, otherwise there would have been no point of giving eq 5.13. Also in theory given in the book, $\pi$ is sometimes taken as epsilon-greedy which is not correlated — ZERO NULLS, Jun 13 '20 at 22:17
You made a mistake in step 2. State $S_{k+1}$ depends on $S_k$ and $A_k$ through conditional probability $p(s'|s,a)$ so you cant do what you did. — Brale, Jun 14 '20 at 08:06
$S_{k+1}$ is indeed dependent on $S_k$ but $A_k|S_k $ is independent of $A_{k+1}|S_{k+1}$ because $P(A_k|S_k)$ means, given $S_k = s$, what's the probability of choosing $A_k = a$ , similarly for $P(A_{k+1}|S_{k+1})$ . Hence since we are given $S_{k+1} = s'$, there is no need to calculate $p(s'|s,a)$. Is something wrong in this way of thinking?? — ZERO NULLS, Jun 14 '20 at 22:07

score 1 · Accepted Answer · 2020-09-10T12:56:15.340

As mentioned in the comments your assumption about independence is wrong. Here's why. To prove independence we need to show the following holds:

$P(X=x, Y=y) = P(X=x)P(Y=y)$

in the case of RL this becomes:

$P(X=a, X=a') = P(X=a)P(Y=a')$

The left hand side has the value:

$P(X=a, Y=a') = b(A_t = a| S_t = s) p(s'|a,s) b(A_{t+1} = a'|, S_{t+1} = s')$

while the right hand side has the value:

$P(X=a)P(Y=a') = b(A_t = a| S_t = s)b(A_{t+1} = a'| S_{t+1} = s')$

And hence not independent.

Now let use look at why the following expression holds:

Eq.5.14: $\mathbb{E}[\rho_{t:T-1}R_{t+k}] = \mathbb{E}[\rho_{t:t+k-1}R_{t+k}]$

I will not derive the exact expressions, but I hope you can form the reasoning I provide. By the rules of probability we know that sum of joint probability is equal to 1 i.e.:

$\sum_{X_1..X_n} P(X_1=a_1, X_2=a_2,...X_n = a_n) = 1$

I have alredy showed above, the trajectory is not independent. So $R_{t+k}$ will depend on the trajectory $S_{t:t+k-1}$ where $S_{t:t+k-1}$ is a particular trajectory. At the end of this trajectory we get a reward $R_{t+k}$ and thus $R_{t+k}$ is exclusively a function of $S_{t:t+k-1}$ i.e. $R_{t+k} = f(S_{t:t+k-1})$. The trajectory after this $S_{t+k:T-1}$ irrelevant since it will always sum up of to 1. i.e once you have reached a particular state at time step $t+k-1$ you are now conditioning based on that $P(S_{t+k:T-1}|S_{t:t+k-1})$ and taking the expected value over all trajectories possible from thereon i.e. $\sum_{S_{t+k:T-1}} P(S_{t+k:T-1}|S_{t:t+k-1})$ which is 1 by probability rules. Thus, what you are really doing is:

$P(S_{t:t+k-1})R_{t+k}(\sum_{S_{t+k:T-1}} P(S_{t+k:T-1}|S_{t:t+k-1}))$

and hence the remaining trajectory has no contribution.

Another way of thinking this is you are taking weighted trajectories till time step $t+k-1$ weighted by rewards $R_{t+k}$ and hence you cannot sum up to 1. The rest of the trajectory after $t+k-1$ will sum up to 1.

I hope this qualitative description suffices. You can do the maths, but you must be careful with the notations and the assumptions you make.

Also all the equations are correct, I hope you can indirectly see it from my reasoning.

unable to understand this step: $P(X=a, Y=a') = b(A_t = a| S_t = s) p(s'|a,s) b(A_{t+1} = a'|, S_{t+1} = s')$ , after solving it, the right hand side of this equation gives $p(s',a',a|s)$ which is not equal to p(a,a'). — ZERO NULLS, Sep 15 '20 at 06:55
@SAHAJGUPTA wait did you get it. You marked the answer as accepted, or should I explain? — , Sep 15 '20 at 11:53
Yep I think you are correct...I made a notaional mistake....what I meant to write was $X=a|s, Y=a'|s'$. I think this makes sense now. — , Sep 15 '20 at 12:26

Swakshar Deb · Answer 2 · 2020-09-11T10:54:21.943

First Part

We can reduce variance in off-policy importance sapling, even in the absence of discounting ($\gamma = 1$). Notice that the off-policy estimators are made up of terms like

$\rho_{t:T-1}G_t = \rho_{t:T-1} (R_{t+1} + \gamma R_{t+2} + \dots+ \gamma^{T-t-1}R_{T})$

and consider the second term, imagine $\gamma$=$1$:

$\rho_{t:T-1}R_{t+2} = \frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})......\pi(A_{T-1}|S_{T-1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})...... b(A_{T-1}|S_{T-1})} R_{t+2}$ In above equation, the term $\pi(A_t|S_t)$, $\pi(A_{t+1}|S_{t+1})$, $R_{t+2}$ are correleated, all the other terms are independent of each other.

Notice the very import property of expectation: $E[ab] = E[a] E[b]$ if and only if $a$, $b$ are independent random variables.

Now:

$E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1}).....\pi(A_{T-1}|S_{T-1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1}).....b(A_{T-1}|S_{T-1})} R_{t+2}]$

$= E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] E[\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}] ..... E[\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}]$

$= E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] \sum_a b(a|s_{t+2}) \frac{\pi(a|s_{t+2}}{b(a|s_{t+2}}.....\sum_a b(a|s_{T-1}) \frac{\pi(a|s_{T-1}}{b(a|s_{T-1}}$

$= E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] \sum_a \pi(a|s_{t+2}).....\sum_a \pi(a|s_{T-1})$

$= E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] 1 * 1$

$= E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}]$ therefore

$E[\rho_{t:T-1}R_{t+2}] = E[\rho_{t:t+1} R_{t+2}]$ If we repeat this analysis for the $k$th term, we will get:

$E[\rho_{t:T-1}R_{t+k}] = E[\rho_{t:t+k-1} R_{t+k}]$ It follows that the expectation of our original term can be written:

$E[\rho_{t:T-1}G_{t}] = E[\tilde{G_{t}}]$ where

$\tilde{G}_t \doteq \rho_{t:t}R_{t+1} + \gamma \rho_{t:t+1}R_{t+2} + \gamma^{2} \rho_{t:t+2}R_{t+3} + ...... + \gamma^{T-t-1} \rho_{t:T-1}R_{T}$ We call this idea per reward importance sampling. It follows immediately that there is an alternative importance sampling estimate, with the same unbiased expectation as the ordinary importance sampling estimate:

$V(s) \doteq \frac{\sum_{t\in\mathcal{T}(s)} \tilde{G}_t}{|\mathcal{T}(s)|}$ which we might expect to sometimes be of lower variance.

Second Part

The reward $R_{k+1}$ depends on the previous $\pi(a_1|s_1)$ up to $\pi(a_{k-1}|s_{k-1})$. So, you can't seperate them and treat them as independent variable just you did on the aforementioned example.

How is per-decision importance sampling derived in Sutton & Barto's book?

From what i understand :

which is not equal to eq 5.14. What's the mistake in the above calculations? Are 1 and 2 correct?

2 Answers2