Shouldn't expected return be calculated for some faraway time in the future $t+n$ instead of current time $t$?

Question

I am learning RL for the first time. It may be naive, but it is a bit odd to grasp this idea that, if the goal of RL is to maximize the expected return, then shouldn't the expected return be calculated for some faraway time in the future ($t+n$) instead of current time $t$? It is because we are building our system for the future using current information. ( I am coming from machine learning background and this makes more sense to me).

Generally, the expected return is: $$\mathbb{E}[G_t] = \mathbb{E}[ R_{t+1} + R_{t+2} + R_{t+3}+... R_{t+n}]$$

However, shouldn't the expected return be: $$\mathbb{E}[G_{t+n}] = \mathbb{E}[R_{t+1} + R_{t+2} + R_{t+3}+... R_{t+n-1}]$$

Neil Slater · Accepted Answer · 2020-05-04T07:15:07.057

shouldn't the expected return be calculated for some faraway time in the future (+) instead of current time $t$?

This is partly a notation issue, but $G_t$ is already the future sum of rewards as seen by the first (and correct) equation in your question. You don't actually know the value of any individual return $g_t$* until after $t+n$. However, you can predict the expected value $\mathbb{E}[G_t]$ provided the environment and policy remain consistent.

Your second equation is pretty similar to the first one, which is why I say it is partly a notation issue. However, the point of calculating the forward-looking expectation for the return is to allow for assessment at time $t$. This is to either predict likely outcome for future rewards knowing the system is in state $s_t$, or in control scenarios to choose an action $a_t$.

In addition, in both prediction and control scenarios, the past (from time steps $0$ to $t-1$) has already happened. Measurements of how well a system did in terms of gathering reward, looking backwards, might be useful metrics for e.g. "how good is this agent?". However, they are not in general a guide to the future. In many sparse environments (e.g. a board game scoring +1 for a win), this data is basically useless for predicting the future and all you want to know is summarised by the current state and the acting policy.

Regardless of when you get to finally calculate a return, the start time step - the part of the trajectory that the return is calculated from, is a key parameter. The end time step is a practical concern for implementation, but can often be considered to be at infinity for theoretical purposes (i.e. we are interested in measuring or optimising all future rewards). So if you are only going to display/use one parameter in the notation, the start time $t$ is the one to use.

There are variants of notation, to show how the return is calculated, where the calculation horizon is made explicit, e.g. $G_{t:t+n}$ for a truncated return or $G_{t:t+1}$ when calculating a one step temporal difference target. All the ones I have seen still maintain the forward view that defines a value associated with a current timestep $t$, for the same reason as explained above - it is at timestep $t$ where this value is of most interest as a prediction.

In practice during training you often wait until $t+n$ before you know the correct value of $g_t$* to apply as a training value - which is then used to update value estimates for $\hat{v}(s_t)$ or $\hat{q}(s_t, a_t)$. It is possible to make partial updates before that ending time step using techniques such as eligibility traces.

* Using notation of uppercase $G$ for random variable, and lowercase $g$ for a measured value.

Shouldn't expected return be calculated for some faraway time in the future $t+n$ instead of current time $t$?

1 Answers1