2

In the maximum entropy inverse reinforcement learning paper, Ziebart et al. show that the state visitation frequency $\rho(s)$ of a state $s$ can be computed as $$ \rho_{\pi}(s) = \sum_{t}^{T} P(s_t=s|\pi), $$ which is the sum of the probability that the state being visited at each time step.

I just don't understand why is it the sum? From my perspective, a frequency should be the less than one, so that it should be the average value $$ \rho_{\pi}(s) = \frac{1}{T}\sum_{t}^{T} P(s_t=s|\pi). $$

skypitcher
  • 31
  • 1
  • My feeling is that they define the first equation and will then normalise it to make it a state distribution. – David Apr 26 '21 at 08:34
  • It should be the average and this is rarely mentioned by people except for a IRL summer camp at UCB. You can check this [GithubIssue](https://github.com/yrlu/irl-imitation/issues/1#issuecomment-532552252) for details. – skypitcher Apr 28 '21 at 08:08

1 Answers1

1

The equation you show does not appear in Ziebart et al (2008). They do provide a description of the computation in Algorithm 1.

It is the visitation frequency and it is not a probability distribution, so it does not need to be averaged.

If you look at Equation 2. in Arora & Doshi (2020), you a formulation that describes the Algorithm 1 quite well:

$\phi^\pi(s) = \phi^0(s) + \sum_{s'\in\mathcal{S}}P(s,\pi(s),s')\phi^\pi(s')$.

I am not very satisfied with this formulation, because, in my opinion, there should also be a summation over $a\in \mathcal{A}$, like $\eta(s)$ the expected number of visits, in Equation 9.2 in Sutton & Barto (2020).:

$\eta(s)=h(s)+\sum_{\bar{s}}\eta(\bar{s})\sum_a\pi(a|\bar{s})p(s|\bar{s},a)$.

To summarize, in all three descriptions, you just calculate how often a state is visited by policy $\pi$.