3

After spending some time reading about POMDP, I'm still having a hard time understanding how grid-based solutions work.

I understand the finite horizon brute-force solution, where you have your current belief distribution, enumerate every possible collection of action/observation combinations for a given depth and find the expected reward.

I have tried to read some sources about grid-based approximations, for example, these slides describe the grid-based approach.

However, it's not clear to me what exactly is going on. I'm not understanding how the value function is actually computed. After you take an action, how do you update your belief states to be consistent with the grid? Does the grid-based solution simply reduce the set of belief states? How does this reduce the complexity of the problem?

I'm not seeing how this reduces the number of actions, observation combinations needed to be considered for a finite-horizon solution.

nbro
  • 39,006
  • 12
  • 98
  • 176
FourierFlux
  • 783
  • 1
  • 4
  • 14

1 Answers1

3

I will attempt to provide an answer to your questions based on the information you can find in the papers A Heuristic Variable Grid Solution Method for POMDPs (1997) by Ronen I. Brafman and Point-based value iteration: An anytime algorithm for POMDPs (2003) by Joelle Pineau et al.

A grid-based approximate solution to a POMDP attempts to estimate a value function only at a subset of the number of belief states. Why? Because estimating the value function for all belief states is typically computationally infeasible for non-small POMDPs, given that the belief-space MDP (i.e. an MDP where the state space consists of probability distributions over the original states of the POMDP) of a POMDP with $n$ states has an uncountably large state space. Why? Because of the involved probability distributions.

How do we compute the value for the belief states that do not correspond to a point of the grid? We can use e.g. interpolation, i.e. the value of a belief state that does not correspond to a point of the grid is computed as a function of the value of the belief states that correspond to other grid points (typically, the neighboring grid points).

Why is this approach feasible? The assumption is that interpolation is not as expensive as computing the value of a belief state. However, note that you may not need to interpolate at every step of your algorithm, i.e. interpolation could be performed only when the value of a certain belief state is required.

How do you compute the value of a belief state that corresponds to a grid point? It can be computed with a value iteration (dynamic programming) algorithm for POMDPs. An overview of a value iteration algorithm can be found in section 2 of the paper Point-based value iteration: An anytime algorithm for POMDPs. Here's an example of the application of the value iteration algorithm for POMDPs.

The grid-based approach, introduced in Computationally Feasible Bounds for Partially Observed Markov Decision Processes (1991) by William S. Lovejoy, is very similar to the point-based approach, which was introduced in Point-based value iteration: An anytime algorithm for POMDPs. The main differences between the two approaches can be found in section 3 of Point-based value iteration: An anytime algorithm for POMDPs.

The idea of discretizing your problem or simply computing the desired value at a subset of the domain has been applied in other contexts too. For example, in the context of computer vision, you can approximate the derivative (or gradient) of an image (which is thus considered a function) at discrete points of the domain (i.e. the pixels).

There's a Julia implementation of the first grid-based approximative solution to POMDP. There's also a Python implementation of the point-based approach. These implementations may help you to understand the details of these approaches.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Thanks, I am still confused on this approximation method though. Actions have associated probability distributions that produce belief states which may not be on the grid, basically your state can be transformed to something outside your state space, so how do you handle this? – FourierFlux Apr 05 '20 at 04:51
  • 1
    @FourierFlux I don't understand what you mean by "outside of the state space" in "basically your state can be transformed to something outside your state space". Anyway, I've updated this answer to include more details and references. – nbro Apr 05 '20 at 13:51
  • Thanks, I will read them. I think my confusion is that the belief state is continuous and an action can take a point in the belief space grid and update it to no longer be in the grid. So it's not clear to me how this is addressed when it comes to computations. Even restricting your actions to a smaller subset won't mean they will always result in transforming a grid point to a grid point. When transforming the POMDP into a discritized MDP, what defines your set of actions and associated probabilities? – FourierFlux Apr 05 '20 at 20:04
  • When you say "an action can take a point in the belief space grid and update it to no longer be in the grid". Well, it's true that an action may move the environment to a state that is not represented in the grid. That's why you may need to use something like _interpolation_. Are you familiar with the concept of interpolation? – nbro Apr 05 '20 at 20:14
  • Lets consider a MDP with a continuous state space, discretizing it still leaves the problem of your actions giving states not in the discretization, where exactly does the interpolation come in when it comes to finding a policy and how does it help? Do you define new actions and round the resulting state to the "closest" one? Doesn't interpolation just lead back to the same issue of having a continuous state space? – FourierFlux Apr 05 '20 at 22:47
  • @FourierFlux I think your first question "Lets consider a MDP with a continuous state space, discretizing it still leaves the problem of your actions giving states not in the discretization" is related to the discretization of continuous space MDPs and how it works. Maybe you could ask a new question. You're very confused. First of all, do you understand that if you find the value function you can derive the policy from it? So, if that's the case, we just find a value function in order to solve an MDP. How? There is a value iteration algorithm for MDPs and there's also one for POMDPs. – nbro Apr 05 '20 at 23:21
  • @FourierFlux Then you say "Do you define new actions", but I don't understand why you would define new actions. Then you say "Doesn't interpolation just lead back to the same issue of having a continuous state space?", interpolation is an operation that allows you to find the values of a function at points where you don't have those values. For instance, if you have $f(a)$ and $f(b)$ and you want to find the value of $f$ at $a \leq c \leq b$, then you can SOMEHOW combine $f(a)$ and $f(b)$ to get $f(c)$. – nbro Apr 05 '20 at 23:23
  • In our context, we have the value of the belief states at certain grid points (we compute this with e.g. _one value iteration algorithm for POMDPs_). If we wanted to find the value of the belief states not at the grid points, we would compute them SOMEHOW by using the value of the belief states at the grid points (which we have!). You don't introduce any new action or whatever. To be more concrete, assume that we have the value at the belief states $b_1, b_3, b_5$ (which correspond to some grid points), i.e. we have $v(b_1)$, $v(b_3)$ and $v(b_5)$ (computed with value iteration). – nbro Apr 05 '20 at 23:27
  • Now, if we wanted to find $v(b_2)$, we could SOMEHOW combine $v(b_1)$ and $v(b_3)$, i.e. $v(b_2) = g(v(b_1), v(b_3))$, where $g$ is some function that combines the inputs. Note that I don't know if this is actually done in practice. I am just giving you the GENERAL idea behind interpolation!! – nbro Apr 05 '20 at 23:30