What is the Bellman operator in reinforcement learning?

Question

In mathematics, the word operator can refer to several distinct but related concepts. An operator can be defined as a function between two vector spaces, it can be defined as a function where the domain and the codomain are the same, or it can be defined as a function from functions (which are vectors) to other functions (for example, the differential operator), that is, a high-order function (if you are familiar with functional programming).

What is the Bellman operator in reinforcement learning (RL)?
Why do we even need it?
How is the Bellman operator related to the Bellman equations in RL?

A few papers related to this topic are _[Feature-Based Methods for Large Scale Dynamic Programming](http://www.mit.edu/people/jnt/Papers/J060-96-bvr-feature.pdf)_ (by John N. Tsitsiklis and Benjamin Van Roy, 1996), _[An Analysis of Temporal-Difference Learning with Function Approximation](http://www.mit.edu/~jnt/Papers/J063-97-bvr-td.pdf)_ (by John N. Tsitsiklis and Benjamin Van Roy, 1997) and _[Least-Squares Policy Iteration](http://www.jmlr.org/papers/volume4/lagoudakis03a/lagoudakis03a.pdf)_ (by Michail G. Lagoudakis and Ronald Parr, 2003). — nbro, Mar 10 '19 at 22:59
A few more related papers that I found are [Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.1316&rep=rep1&type=pdf) (by Csaba Szepesvári and Michael L. Littman, 1997) and [$\epsilon$-MDPs: Learning in Varying Environments](http://www.jmlr.org/papers/volume3/szita02a/szita02a.pdf) (by István Szita, Bálint Takács, András Lörincz, 2002). — nbro, Mar 14 '19 at 19:34

Philip Raeisghasem · Accepted Answer · 2020-08-12T17:40:35.970

The notation I'll be using is from two different lectures by David Silver and is also informed by these slides.

The expected Bellman equation is $$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$$

If we let $$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$$ and $$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$$ then we can rewrite $(1)$ as

$$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$$

This can be written in matrix form

$$\left. \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} \cal{R}_1^\pi \\ \vdots \\ \cal{R}_n^\pi \end{bmatrix} +\gamma \begin{bmatrix} \cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\ \vdots & \ddots & \vdots\\ \cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} \right. \tag 5$$

Or, more compactly,

$$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$$

Notice that both sides of $(6)$ are $n$-dimensional vectors. Here $n=|\cal{S}|$ is the size of the state space. We can then define an operator $\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$ as

$$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$$

for any $v\in \mathbb{R}^n$. This is the expected Bellman operator.

Similarly, you can rewrite the Bellman optimality equation

$$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$$

as the Bellman optimality operator

$$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values, $\mathbb{R}^n$.

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g. policy iteration, value iteration) converge to a unique fixed point. This usefulness comes in the form of a body of existing work in operator theory, which allows us to make use of special properties of the Bellman operators.

Specifically, the fact that the Bellman operators are contractions gives the useful results that, for any policy $\pi$ and any initial vector $v$,

$$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$$

$$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$$

where $v_\pi$ is the value of policy $\pi$ and $v_*$ is the value of an optimal policy $\pi^*$. The proof is due to the contraction mapping theorem.

What is the Bellman operator in reinforcement learning?

1 Answers1

Linked