Questions tagged [sarsa]

For questions related to the reinforcement learning (on-policy) algorithm called SARSA, which stands for (s, a, r, s', a').

43 questions
11
votes
1 answer

Are Q-learning and SARSA the same when action selection is greedy?

I'm currently studying reinforcement learning and I'm having difficulties with question 6.12 in Sutton and Barto's book. Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as SARSA? Will they make exactly the same…
10
votes
1 answer

Can Q-learning be used in a POMDP?

Can Q-learning (and SARSA) be directly used in a Partially Observable Markov Decision Process (POMDP)? If not, why not? My intuition is that the policies learned will be terrible because of partial observability. Are there ways to transform these…
8
votes
2 answers

How should I handle action selection in the terminal state when implementing SARSA?

I recently started learning about reinforcement learning. Currently, I am trying to implement the SARSA algorithm. However, I do not know how to deal with $Q(s', a')$, when $s'$ is the terminal state. First, there is no action to choose from in this…
Hai Nguyen
  • 552
  • 4
  • 14
6
votes
1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…
5
votes
1 answer

Understanding the n-step off-policy SARSA update

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
5
votes
1 answer

Expected SARSA vs SARSA in "RL: An Introduction"

Sutton and Barto state in the 2018-version of "Reinforcement Learning: An Introduction" in the context of Expected SARSA (p. 133) the following sentences: Expected SARSA is more complex computationally than Sarsa but, in return, it eliminates the…
4
votes
2 answers

Is the optimal policy the one with the highest accumulative reward (Q-Learning vs SARSA)?

I was looking at the following diagram, The reward obtained with SARSA is higher. However, the path that Q learning chooses is eventually the optimal one, isn't it? Why is the SARSA reward higher if it is not choosing the best path? shouldn't the…
Pulse9
  • 282
  • 1
  • 7
4
votes
1 answer

How should I generate datasets for a SARSA agent when the environment is not simple?

I am currently working on my master's thesis and going to apply Deep-SARSA as my DRL algorithm. The problem is that there is no datasets available and I guess that I should generate them somehow. Datasets generation seems a common feature in this…
4
votes
1 answer

When do SARSA and Q-Learning converge to optimal Q values?

Here's another interesting multiple-choice question that puzzles me a bit. In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state, randomly selects an action, then: Q-learning will…
stoic-santiago
  • 1,121
  • 5
  • 18
3
votes
1 answer

Can we also estimate $V_{\pi}$ with SARSA?

For SARSA, I know we can estimate the action value $Q(s,a)$, and the relationship between $V(s)$ and $Q(s,a)$ is $V_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s)Q_{\pi} (s,a)$. So my question is, can we simply estimate $V_{\pi}$ by applying the above…
3
votes
1 answer

When does backward propagation occur in n-step SARSA?

I am trying to understand the algorithm for n-step SARSA from Sutton and Barto (2nd Edition). As I understand it, this algorithm should update n state-action values, but I cannot see where it is propagated backward. Can someone explain to me how…
nehalem
  • 131
  • 2
3
votes
1 answer

How to determine if Q-learning has converged in practice?

I am using Q-learning and SARSA to solve a problem. The agent learns to go from the start to the goal without falling in the holes. At each state, I can choose the action corresponding to the maximum Q value at the state (the greedy action that the…
3
votes
1 answer

Can the agent wait until the end of the episode to determine the reward in SARSA?

From Sutton and Barto's book Reinforcement Learning (Adaptive Computation and Machine Learning series) (p. 99), the following definition for first-visit MC prediction, for estimating $V \sim V_\pi$ is given: Is determining the reward for each…
blue-sky
  • 325
  • 1
  • 11
3
votes
0 answers

Evaluation a policy learned using Q - learning

I have been reading literature on reinforcement learning in healthcare. I am slightly confused between the policy evaluation for both SARSA and Q-learning. To my knowledge, I believe that SARSA is used for policy evaluation, to find the Q values of…
calveeen
  • 1,251
  • 7
  • 17
3
votes
1 answer

What is the difference between the $\epsilon$-greedy and softmax policies?

Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…
1
2 3