3

In David Silver's 8th lecture he talks about model learning and says that learning $r$ from $s,a$ is a regression problem whereas learning $s'$ from $s,a$ is a kernel density estimation. His explanation for the difference is that if we are in a stochastic environment and we are in the tuple $s,a$ then there might be a 30% chance the wind will blow me left, and a 70% chance the wind will blow me right, so we want to estimate these probabilities.

Is the main difference between these two problems, and hence why one is regression and the other is kernel density estimation, because with the reward we are mainly concerned with the expected reward (hence regression) whereas with the state transitioning, we want to be able to simulate this so we need the estimated density?

David
  • 4,591
  • 1
  • 6
  • 25

1 Answers1

2

Is the main difference between these two problems, and hence why one is regression and the other is kernel density estimation, because with the reward we are mainly concerned with the expected reward (hence regression) whereas with the state transitioning, we want to be able to simulate this so we need the estimated density?

Yes.

An expected reward function from $s,a$ is all you need to construct valid Bellman equations for value functions. For example

$$q_{\pi}(s,a) = r(s,a) + \gamma\sum_{s'}p(s'|s,a)\sum_{a'}\pi(a'|s')q(s',a')$$

is a valid way of writing the Bellman equation for action values. You can derive this from $r(s,a) = \sum_{r,s'}rp(r,s'|s,a)$ and $q_{\pi}(s,a) = \sum_{r,s'}p(r,s'|s,a)(r + \gamma\sum_{a'}\pi(a'|s')q(s',a'))$ if you have the equations in that form.

However, in general there is no such thing as an "expected state" when there is more than one possible outcome (i.e. in environments with stochastic state transitions). You can take a mean of the state vector representations over the samples you see for $s'$ but that is not the same thing at all and could easily be a representation of an unreachable/nonsense state.

In some cases, the expectation $\mathbb{E}_{\pi}[x(S_{t+1})|S_t=s, A_t=a]$ where $x(s)$ creates a feature vector from any given state $s$, $x(s): \mathcal{S} \rightarrow \mathbb{R}^d$, can be meaningful. The broadest and most trivial example of this is for deterministic environments. You may be able to construct stochastic environments where there is a good interpretation of such a vector, even if it does not represent any reachable state.

Simple one-hot encoded states could maybe made to work like this by representing a probability distribution over states (this would also require re-interpretations of expected reward function and value functions). That is effectively a kernel density function over discrete state space.

In general knowing this $\mathbb{E}_{\pi}[x(S_{t+1})|S_t=s, A_t=a]$ expected value does not help resolve future rewards, as they can depend arbitrarily on specific state transitions.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Thanks - I understood why you wouldn't want to take the expected value of the state distribution, I just wanted to double check why we weren't interested in the full distribution of the rewards, just the expected value. – David May 29 '20 at 12:24
  • @DavidIreland: I probably focussed too much on the state distribution issue and not enough on expected reward then. I may come back and address that – Neil Slater May 29 '20 at 12:30
  • For me the comment you made about only needing expected reward to be able to construct the Bellman equations for enough but the new edit is very clear. – David May 29 '20 at 12:42