1

I was wondering how would one normalize observations to a policy without knowing the upper and lower limits of the environment values. A trivial technique would be normalize each observation by its maximum value before inputting it into a policy. However, I feel that doing so, could change the distribution of the data. For instance, let's say we have 2 inputs that we normalize on the fly - [5, 15, 20] and [10, 20, 30] to [0.25, .75, 1] and [0.33, .66 and 1]. Now let's say the true maximum value of the environment is 100. Then the true normalized values should have been [0.05, .15, .20] and [0.1, .20, 0.3]. Wouldn't this adversely affect learning?

desert_ranger
  • 586
  • 3
  • 19

1 Answers1

3

Alternatively you can compute a running mean, $\mu_t$, and std, $\sigma_t$, of your online data $x_t$, and then standardize at each timestep, $t$: $$\begin{align} \mu_t &\leftarrow \mu_{t+1} + \frac{(x_t - \mu_{t-1})}{n}\\ n &\leftarrow n + 1 \\ \sigma_t &\leftarrow \sigma_{t-1} + (x_t - \mu_t) * (x_t - \mu_{t-1}) \end{align}$$

Initially you would have: $\mu_0=0$, $\sigma_0=1$, and $n=1$.

For reference see: 1 and 2.

Luca Anzalone
  • 2,120
  • 2
  • 13