In the definition of the state-action value function, what is the random variable we take the expectation of?

Question

I know that

$$\mathbb{E}[g(X) \mid A] = \sum\limits_{x} g(x) p_{X \mid A}(x)$$

for any random variable $X$.

Now, consider the following expression.

$$\mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a \right]$$

It is used for the calculation of Q values.

I can understand the following

$A$ is $\{s_t = s, a_t = a\}$ .i.e., agent has been performed action $a$ on state $s$ at time step $t$ and
$g(X)$ is $\sum\limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1}$ i.e., return (long run reward).

What I didn't understand is what is $X$ here. i.e., what is the random variable on which we are calculating long-run rewards?

My guess is policy function. It is averaging long-run rewards over all possible policy functions. Is it true?

score 2 · Accepted Answer · edited Jan 14 '22 at 23:51

2

I am using the convention of uppercase $X$ for random variable and lowercase $x$ for an individual observation. It is possible your source material did not do this, which might be causing your confusion. However, it is the convention used in Sutton & Barto's Reinforcement Learning: An Introduction.

What I didn't understand is what is here. i.e., what is the random variable on which we are calculating long-run rewards?

The random variable is $R_t$, the reward at each time step. The distribution of $R_t$ in turn depends on the distribution of $S_{t-1}$ and $A_{t-1}$ plus the policy and state progression rules. There is no need to include the process that causes the distribution of each $R_t$ in every equation. Although sometimes it is useful to do so, for example when deriving the Bellman equations for value functions.

My guess is policy function. It is averaging long-run rewards over all possible policy functions. Is it true?

No, this is not true. In fact, it is the more usual assumption that the policy function $\pi(a|s)$ remains constant over the expectation, and this is what the subscript $\pi$ in $\mathbb{E}_{\pi}[...]$ means.

The expectation is over randomness due to the policy $\pi$, plus randomness due to the environment, which can be described by the function $p(r, s'|s, a)$ - the probability of observing reward $r$ and next state $s'$ given starting in state $s$ and taking action $a$. These two functions combine to create the distribution of $R_t$. It is possible that both functions are deterministic in practice, thus $R_t$ is also deterministic. However, RL theory works on the more general stochastic case, which is also used to model exploratory actions, even if the target policy and environment are deterministic.

edited Jan 14 '22 at 23:51

hanugm

3,571
3
18
50

answered Oct 17 '20 at 10:00

Neil Slater

28,678
3
38
60

To me, it's not fully clear what you mean by this sentence "_There is no need to confuse the process that causes the distribution of each $R_t$ (which can be broken down analyitically, and this is done to derive the Bellman equations), with the fact that you are measuring $R_t$ directly in the expectation._" – nbro Oct 18 '20 at 10:10
1

@nbro I've had an attempt to reword that paragraph. – Neil Slater Oct 18 '20 at 11:24
If $R$ is the random variable, then all the possible values of $R, (r_{t+1}, r_{t+2}.....)$ are used in the summation. Then on which values does the expectation is calculated on? I tried a lot, but can't able to figure out where I am going wrong. Is it on all possible reward functions? – hanugm Oct 19 '20 at 07:51
@hanugm If $R_t$ is a random variable, it has a _distribution_, The expectation is calculated over that distribution. The precise distribution depends on the policy being used, plus the state progression and reward function from the environment. Typically in RL you don't work with the full distribution of the return ($G_t$ or sum of rewards), although that may be possible for very simple environments. Instead you work either with samples from the distribution (e.g. Monte Carlo methods) or the Bellman equation (e.g. Value Iteration) or both (e.g. TD learning). – Neil Slater Oct 19 '20 at 08:05
So, If I have a fixed Immediate reward function and policy function, then there is no need of the expectation. Is it right? – hanugm Oct 26 '20 at 03:02
@hanugm: You would need a fixed *deterministic* policy, a *deterministic* reward function and a *deterministic* state progression. Then in theory all sampled results would exactly equal the expectation. That is not a usual setup for RL, because you also want to explore and typically that is done by having a stochastic policy. – Neil Slater Oct 26 '20 at 08:44
@hanugm: The equations with the expectation are used to figure out RL theory. They are directly relevant for methods like Value Iteration. Other than that, unless you are working on the thoery (studying existing approaches, or devising new ones) then you don't need to work with it directly. – Neil Slater Oct 26 '20 at 08:49

In the definition of the state-action value function, what is the random variable we take the expectation of?

1 Answers1