In section 2 the paper Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review the author is discussing formulating the RL problem as a probabilistic graphical model. They introduce a binary optimality variable $\mathcal{O}_t$ which denotes whether time step $t$ was optimal (1 if so, 0 otherwise). They then define the probability that this random variable equals 1 to be
$$\mathbb{P}(\mathcal{O}_t = 1 | s_t, a_t) = \exp(r(s_t, a_t)) \; .$$
My question is why do they do this? In the paper they make no assumptions about the value of the rewards (e.g. bounding it to be non-positive) so in theory the rewards can take any value and thus the RHS can be larger than 1. This is obviously invalid for a probability. It would make sense if there was some normalising constant, or if the author said that the probability is proportional to this, but they don't.
I have searched online and nobody seems to have asked this question which makes me feel like I am missing something quite obvious so I would appreciate if somebody could clear this up for me please.