What is the correct interpretation of the discount factor in MDPs?

Question

In infinite-horizon MDPs one can consider the expected discounted return from a distribution of start states as the objective[^1]. i.e. $\mathbb{E}[V^{\pi}(S_0)] = \mathbb{E}[G_0] = \mathbb{E}[\sum_{t=0}^\infty \gamma^t R_{t+1}]$ where the expectation is over start states $S_0$, following states taken from the transition dynamics of the MDP, and actions from a policy $\pi$.

In this setting, discounting is interpreted as (1) giving less importance to future rewards, or (2) having a $1-\gamma$ probability of being truncated forever at every step as mentioned in this answer. The answer claims that the or is exclusive and that "choosing one or the other means tackling a different problem". That (1) is implemented by adding $\gamma$ to the return, while (2) by simulating the truncation.

I believe this not to be true and that it does not make sense to include $\gamma$ in the return (thus the state and action value functions implicitly as well) without interpreting it as a probability of being truncated at the same time (i.e. part of the MDP. Interpretation 2 in the answer).

In the context/question where the answer is taken from, if one derives the policy gradient with respect to the discounted return as the objective, then the discounted state distribution arises (the one corresponding to interpretation 2), not the original one. Proof here. Hence, using $\hat{\nabla J_1} = G_t \nabla\log\pi(A_t | S_t)$ where $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$ is a biased estimator for the policy gradient when "giving less importance to future rewards". $\hat{\nabla J_2} = \gamma^tG_t \nabla\log\pi(A_t | S_t)$ would be the correct one.

It's as if one implicitly changes the MDP by discounting (or equivalently that $\gamma$ is part of the definition of the MDP not only the objective).

Does this make sense?

[^1]: Another objective is the average reward.

What is the correct interpretation of the discount factor in MDPs?

0 Answers0

Linked