Questions tagged [sutton-barto]

For questions related to the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton).

"Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) is often considered or cited as the most comprehensive introductory manual to the field of RL, by two of the greatest contributors to the field.

Two editions have been published so far. The first edition was published in 1998 and the second in 2018. You can find some material related to this book (including some drafts) at the following URL: http://incompleteideas.net/book/.

88 questions
22
votes
2 answers

What is the difference between reinforcement learning and optimal control?

Coming from a process (optimal) control background, I have begun studying the field of deep reinforcement learning. Sutton & Barto (2015) state that particularly important (to the writing of the text) have been the contributions establishing and…
17
votes
4 answers

Why does the discount rate in the REINFORCE algorithm appear twice?

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017). On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at…
12
votes
4 answers

Counterexamples to the reward hypothesis

On Sutton and Barto's RL book, the reward hypothesis is stated as that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called…
11
votes
2 answers

How do we prove the n-step return error reduction property?

In section 7.1 (about the n-step bootstrapping) of the book Reinforcement Learning: An Introduction (2nd edition), by Andrew Barto and Richard S. Sutton, the authors write about what they call the "n-step return error reduction property": But they…
8
votes
1 answer

How is the policy gradient calculated in REINFORCE?

Reading Sutton and Barto, I see the following in describing policy gradients: How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the…
7
votes
2 answers

In Value Iteration, why can we initialize the value function arbitrarily?

I have not been able to find a good explanation of this, other than statements that the algorithm is guaranteed to converge with arbitrary choices for initial values in each state. Is this something to do with the Bellman optimality constraint…
6
votes
2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…
6
votes
5 answers

How do compute the table for $p(s',r|s,a)$ (exercise 3.5 in Sutton & Barto's book)?

I am trying to study the book Reinforcement Learning: An Introduction (Sutton & Barto, 2018). In chapter 3.1 the authors state the following exercise Exercise 3.5 Give a table analogous to that in Example 3.3, but for $p(s',r|s,a)$. It should have…
6
votes
1 answer

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being \begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S}…
5
votes
1 answer

If the current state is $S_t$ and the actions are chosen according to $\pi$, what is the expectation of $R_{t+1}$ in terms of $\pi$ and $p$?

I'm trying to solve exercise 3.11 from the book Sutton and Barto's book (2nd edition) Exercise 3.11 If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms…
5
votes
1 answer

Should the policy parameters be updated at each time step or at the end of the episode in REINFORCE?

REINFORCE is a Monte Carlo policy gradient algorithm, which updates weights (parameters) of policy network by generating episodes. Here's a pseudo-code from Sutton's book (which is same as the equation in Silver's RL note): When I try to implement…
5
votes
1 answer

Understanding the n-step off-policy SARSA update

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
5
votes
1 answer

Expected SARSA vs SARSA in "RL: An Introduction"

Sutton and Barto state in the 2018-version of "Reinforcement Learning: An Introduction" in the context of Expected SARSA (p. 133) the following sentences: Expected SARSA is more complex computationally than Sarsa but, in return, it eliminates the…
4
votes
2 answers

Why is $\sum_{s} \eta(s)$ a constant of proportionality in the proof of the policy gradient theorem?

In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3…
4
votes
1 answer

$E_{\pi}[R_{t+1}|S_t=s,A_t=a] = E[R_{t+1}|S_t=s,A_t=a]$?

I would like to solve the first question of Exercise 3.19 from Sutton and Barto: Exercise 3.19 The value of an action, $q_{\pi}(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in…
1
2 3 4 5 6