For questions related to the Bellman equations in the context of reinforcement learning (and other artificial intelligence subfields).
Questions tagged [bellman-equations]
60 questions
27
votes
1 answer
What is the Bellman operator in reinforcement learning?
In mathematics, the word operator can refer to several distinct but related concepts. An operator can be defined as a function between two vector spaces, it can be defined as a function where the domain and the codomain are the same, or it can be…

nbro
- 39,006
- 12
- 98
- 176
9
votes
2 answers
Why are the Bellman operators contractions?
In these slides, it is written
\begin{align}
\left\|T^{\pi} V-T^{\pi} U\right\|_{\infty} & \leq \gamma\|V-U\|_{\infty} \tag{9} \label{9} \\
\|T V-T U\|_{\infty} & \leq \gamma\|V-U\|_{\infty} \tag{10} \label{10}
\end{align}
where
$F$ is the space of…

kevin
- 191
- 1
- 4
8
votes
1 answer
How is the DQN loss derived from (or theoretically motivated by) the Bellman equation, and how is it related to the Q-learning update?
I'm doing a project on Reinforcement Learning. I programmed an agent that uses DDQN. There are a lot of tutorials on that, so the code implementation was not that hard.
However, I have problems understanding how one should come up with this kind of…

Yves Boutellier
- 183
- 6
8
votes
2 answers
What is the proof that policy evaluation converges to the optimal solution?
Although I know how the algorithm of iterative policy evaluation using dynamic programming works, I am having a hard time realizing how it actually converges.
It appeals to intuition that, with each iteration, we get a better and better…

SAGALPREET SINGH
- 147
- 1
- 6
7
votes
1 answer
Why do Bellman equations indirectly create a policy?
I was watching a lecture on policy gradients and Bellman equations. And they say that a Bellman equation indirectly creates a policy, while the policy gradient directly learns a policy. Why is this?

echo
- 673
- 1
- 5
- 12
7
votes
0 answers
Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?
It is proved that the Bellman update is a contraction (1).
Here is the Bellman update that is used for Q-Learning:
$$Q_{t+1}(s, a) = Q_{t}(s, a) + \alpha*(r(s, a, s') + \gamma \max_{a^*} (Q_{t}(s',
a^*)) - Q_t(s,a)) \tag{1} \label{1}$$
The proof…

sirfroggy
- 71
- 3
7
votes
2 answers
Why does the state-action value function, defined as an expected value of the reward and state value function, not need to follow a policy?
I often see that the state-action value function is expressed as:
$$q_{\pi}(s,a)=\color{red}{\mathbb{E}_{\pi}}[R_{t+1}+\gamma G_{t+1} | S_t=s, A_t = a] = \color{blue}{\mathbb{E}}[R_{t+1}+\gamma v_{\pi}(s') |S_t = s, A_t =a]$$
Why does expressing the…

Daniel Wiczew
- 323
- 2
- 10
6
votes
2 answers
What is the Bellman Equation actually telling?
What does the Bellman equation actually say? And are there many flavours of that?
I get a little confused when I look for the Bellman equation, because I feel like people are telling slightly different things about what it is. And I think the…

Johnny
- 69
- 3
5
votes
1 answer
How would I compute the optimal state-action value for a certain state and action?
I am currently trying to learn reinforcement learning and I started with the basic gridworld application. I tried Q-learning with the following parameters:
Learning rate = 0.1
Discount factor = 0.95
Exploration rate = 0.1
Default reward = 0
The…

Rim Sleimi
- 215
- 1
- 6
4
votes
1 answer
In reinforcement learning, does the optimal value correspond to performing the best action in a given state?
I am confused about the definition of the optimal value ($V^*$) and optimal action-value (Q*) in reinforcement learning, so I need some clarification, because some blogs I read on Medium and GitHub are inconsistent with the literature.
Originally, I…

Rui Nian
- 423
- 3
- 13
4
votes
2 answers
How do we get the optimal value-function?
In here it says that: (is it correct?)
$$V^\pi = \sum_{a \in A}\pi(a|s)*Q^\pi(s,a)$$
And we have:
$$ V^*(s) = max_\pi V^\pi(s)$$
Also:
$$ V^*(s) = max_a Q^*(s, a) $$
Can someone demonstrate to me step by step how we got from $ V^*(s) = max_\pi…

Ness
- 216
- 1
- 8
4
votes
1 answer
What do the terms 'Bellman backup' and 'Bellman error' mean?
Some RL literature use terms such as: 'Bellman backup' and 'Bellman error'. What do these terms refer to?

user529295
- 359
- 1
- 10
4
votes
1 answer
How to prove the second form of Bellman's equation?
I'd like to prove this "second form" of Bellman's equation: $v(s) = \mathbb{E}[R_{t + 1} + \gamma v(S_{t+1}) \mid S_{t} = s]$ starting from Bellman's equation: $v(s) = \mathbb{E}[G_{t} \mid S_{t} = s]$ where the return $G_{t}$ is defined as follows:…

Daviiid
- 563
- 3
- 15
4
votes
1 answer
How are afterstate value functions mathematically defined?
In this answer, afterstate value functions are mentioned, and that temporal-difference (TD) and Monte Carlo (MC) methods can also use these value functions. Mathematically, how are these value functions defined? Yes, they are a function of the next…

nbro
- 39,006
- 12
- 98
- 176
4
votes
1 answer
Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?
I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration.
In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…

Chukwudi Ogbonna
- 125
- 4