5

Most RL books (Sutton & Barto, Bertsekas, etc.) talk about policy iteration for infinite-horizon MDPs. Does the policy iteration convergence hold for finite-horizon MDP? If yes, how can we derive the algorithm?

nbro
  • 39,006
  • 12
  • 98
  • 176
user529295
  • 359
  • 1
  • 10

1 Answers1

3

In the discussion about Neil Slater's answer (that he, sadly, deleted) it was pointed out that the policy $\pi$ should also depend on the horizon $h$. The decision of action $a$ can be influenced by how many steps are left. So, the "policy" in that case is actually a collection of policies $\pi_h(a|s)$ indexed by $h$ - the distance to horizon.

Alternatively, one can look at it as if our state space $\mathcal{S}$ is now extended with that integer. So we can "lift" our original finite-horizon MDP $(\mathcal{S},\mathcal{A},P,R)$ into a new infinite-horizon MDP by substituting:

$$\mathcal{S} \to \left(\mathcal{S}\times\{0,1,\dots,h\}\right) \cup \{\epsilon\}\\ \mathcal{A} \to \mathcal{A},\; R \to \tilde R,\; P \to \tilde P $$ The new reward and transition functions make sure that the horizon is decreasing, and that we are ending up in a capture state $\epsilon$ that has no future effects: $$ \tilde P(s'_{n-1}|s_n,a) = P(s'|s,a)\quad\quad \tilde P(\epsilon|s_0,a) =\tilde P(\epsilon|\epsilon,a) = 1\\ \tilde R(s_n,a,s'_{n-1}) = R(s,a,s')\quad\quad \tilde R(s_0,a,\epsilon) =\tilde R(\epsilon,a,\epsilon) = 0 $$ This way I've reduced the finite-horizon MDP to an infinite-horizon one. Thus I can reuse the result for the policy iteration convergence of infinite MDPs.

Couple of notes:

  • At first this feels like a huge increase in the state space making the whole problem unnecessary complex. But this complexity is intrinsic to the problem: both the policy and the value function depend on the distance to horizon. So it is necessary to consider the extended number of unknowns in a single self-consistent manner.
  • The infinite-horizon policy iteration convergence relies on a discounting factor $\gamma < 1$. The finite-horizon doesn't need $\gamma$ for convergence. That's where I feel like I've cheated a bit.
  • I came up with this approach myself. It feels quite obvious though. I'd expect this approach to either be mistaken or be already mentioned somewhere in the literature - comments pointing out one or the other are welcome.
Kostya
  • 2,416
  • 7
  • 23
  • Intuitively, I understand that the policy should depend on horizon $h$. However, why do we have to "extend" our state space? Also, why did @neilslater delete his answer? I didn't quite see his answer. – user529295 Apr 16 '21 at 18:38
  • @user529295 The idea is that, instead of having multiple policies, you have multiple versions of each state. And your transition probabilities always decrease the numerical index associated to each state. – Kostya Apr 16 '21 at 18:44
  • Regarding your reservation with respect to discount factor $\gamma$, my view is that finite horizon MDP is similar to stochastic shortest path with fixed horizon rather than random. In that way, we don't have to rely on discounted MDP formulation (or discount factor, in general). However, I ain't sure if my view is correct. – user529295 Apr 16 '21 at 18:45
  • @user529295 I think you are exactly right - the new "lifted" MDP can be seen as a SSP with the target state $\epsilon$ – Kostya Apr 16 '21 at 19:35
  • 1
    @user529295 I deleted my answer because I was concerned that it may be wrong and misleading. In it I give an algorithm that works for one interpretation of "optimal" for an MDP with horizon held at $t + h$ for all value functions. However, I am concerned it is the wrong interpretation of "optimal", and I did not want to confuse things – Neil Slater Apr 16 '21 at 20:48
  • I don't think your answer is correct. One of the primary requirements for convergence in FHP is that the policy must be proper. I don't think there is such a requirement in IHP i.e although I think it is a mathematical technicality (I can be wrong too) –  Apr 19 '21 at 13:14
  • @DuttaA I'm not totally sure, but if "proper" means - "reaches the goal state with P=1". Then any policy over "lifted" MDP is "proper", because the transition function is structured that way. – Kostya Apr 19 '21 at 13:26
  • If the policy is not proper FHP becomes IHP. I don't know how transition function is useful here. –  Apr 19 '21 at 13:32
  • @DuttaA The dynamics (transition function) is such that you always reach the goal state - no matter what your policy is. – Kostya Apr 19 '21 at 13:47
  • Thats a very strong assumption. –  Apr 19 '21 at 13:50
  • 3
    @DuttaA That's not an assumption - that's how I've constructed the "lifted" MDP. Every state has an index $s_i$ that gets decremented at every transition. When the index reaches 0 - we transition to the goal state. – Kostya Apr 19 '21 at 14:03
  • 1
    @DuttaA You're incorrect about your previous comment "If the policy is not proper FHP becomes IHP". You only need proper policies for SSP problems, which are one version of Infinite horizon problems. You can do away with proper policies if your cost is bounded, thereby, leading to discounted MDPs. Similarly, for finite horizon problems, you only need non-stationary policies, and not proper policies. – user529295 Apr 23 '21 at 05:47
  • @user529295 'your cost is bounded' How do you plan on bounding the cost in SSP without proper policy? By FHP I meant SSP here since that's the theme being followed in the answer (it seemed to me atleast). Also for FHP (by strict definition you are using) costs will always be bounded. And lastly SSP does become IHP if policy is not proper. SSP proofs do not hold for IHP for a reason and that is proper policies. Also SSP costs will be bounded by deafult if policies are proper and that is why we need it (and another small assumption) –  Apr 23 '21 at 06:19
  • @user529295 I see now, it seems I have mistaken the question to be asking about SSP as well. Did you strictly mean FHP only? (SSP and FHP are interchangably used in Sutton and Barto as far as I know). For FHP I believe the convergence proof would be even easier. –  Apr 23 '21 at 06:32
  • 1
    @DuttaA SSP was never mentioned neither in the question or in the answer. Both question and answer repeatedly state that they deal with MDP. Just saying. – Kostya Apr 23 '21 at 08:16
  • In Sutton and Barto they use it interchangably(if I recall correctly). Infact there FHP problems are actually SSP problems. Gridworlds, gambling are all formulated/solved as SSP problems (if I recall correctly). So I assumed the same. I am curious about what you mean by MDP? A SSP can't be MDP? –  Apr 23 '21 at 08:28
  • 1
    @DuttaA There is not a single mention of SSPs in Sutton and Barto. They define MDP in the 3rd chapter and use MDPs throughout the book. – Kostya Apr 23 '21 at 08:37
  • Doesn't mention doesn't mean it has not been used. FHP and SSP are pretty distinct problems. For example a mulyiple Chess games between two players for $N$ fixed rounds is a very nice example of FHP (the agent decides whether to play risky or safe in a particular chess game of the $N$ chess games) while make the $N$ a random variable and we have a SSP. Gridworlds and Gambler's problem should fall under this categorisation (I don't exactly recall how they formulated it though) –  Apr 23 '21 at 08:49
  • 2
    Yes, MDPs and SSPs are distinct. In this Q/A we've been talking about MDPs. The Sutton and Barto book is about MDPs. You can reformulate some MDPs as SSPs if you want. This Q/A doesn't discuss this. Sutton and Barto don't either. – Kostya Apr 23 '21 at 09:03