1

In section 2 the paper Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review the author is discussing formulating the RL problem as a probabilistic graphical model. They introduce a binary optimality variable $\mathcal{O}_t$ which denotes whether time step $t$ was optimal (1 if so, 0 otherwise). They then define the probability that this random variable equals 1 to be

$$\mathbb{P}(\mathcal{O}_t = 1 | s_t, a_t) = \exp(r(s_t, a_t)) \; .$$

My question is why do they do this? In the paper they make no assumptions about the value of the rewards (e.g. bounding it to be non-positive) so in theory the rewards can take any value and thus the RHS can be larger than 1. This is obviously invalid for a probability. It would make sense if there was some normalising constant, or if the author said that the probability is proportional to this, but they don't.

I have searched online and nobody seems to have asked this question which makes me feel like I am missing something quite obvious so I would appreciate if somebody could clear this up for me please.

David
  • 4,591
  • 1
  • 6
  • 25
  • Some papers that I've come across assumed that the reward is normalized to be in a certain range. Not sure if the author is _implicitly_ assuming that in that context or not. Alternatively, he may be using $\operatorname{exp}$ to denote any exponential function with codomain $[0, 1]$, for instance, the sigmoid, but that seems a bit uncommon, given that $\operatorname{exp}$ is typically used to denote $e$. My last guess is that the author was careless. I didn't really read the paper and these are more guesses. You could try to e-mail him. – nbro Dec 19 '20 at 18:56
  • 1
    @nbro I’d noticed it in another paper I’d read briefly but I can’t remember it now (it wasn’t really relevant for my research) so I can’t remember if they made the assumptions on the range of the rewards. I think I’ll try and email the author and maybe answer this myself. – David Dec 19 '20 at 19:45

1 Answers1

0

After doing some further reading, it turns out that negative rewards are an assumption for this distribution to hold. However, the author notes that as long as you don't receive a reward of infinity for any action then it is possible to re-scale your rewards by subtracting the maximum value of your potential rewards so that they are always negative.

David
  • 4,591
  • 1
  • 6
  • 25