For questions about the $\epsilon$-greedy policy, which is typically used as a behavioural policy (i.e. a policy used to interact with the environment) during the interaction of reinforcement learning agents with the environment.
Questions tagged [epsilon-greedy-policy]
27 questions
7
votes
1 answer
What happens when you select actions using softmax instead of epsilon greedy in DQN?
I understand the two major branches of RL are Q-Learning and Policy Gradient methods.
From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability…

Linsu Han
- 73
- 4
6
votes
1 answer
What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?
I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following:
In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…

Daviiid
- 563
- 3
- 15
6
votes
1 answer
Is this proof of $\epsilon$-greedy policy improvement correct?
The following paragraph about $\epsilon$-greedy policies can be found at the end of page 100, under section 5.4, of the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto (second edition, 2018).
but with probability…

Nishanth Rao
- 147
- 6
5
votes
1 answer
Why does Q-learning converge under 100% exploration rate?
I am working on this assignment where I made the agent learn state-action values (Q-values) with Q-learning and 100% exploration rate. The environment is the classic gridworld as shown in the following picture.
Here are the values of my…

Rim Sleimi
- 215
- 1
- 6
5
votes
1 answer
Multi Armed Bandits with large number of arms
I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms.
Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

D. B.
- 101
- 6
4
votes
1 answer
What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?
I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation:
$$\frac{\varepsilon}{|\mathcal{A}(s)|}…

Metrician
- 95
- 5
3
votes
2 answers
How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?
In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following:
The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal…

user3489173
- 179
- 6
3
votes
2 answers
How to fight with unstability in self play?
I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…

Maras
- 141
- 5
3
votes
1 answer
Can we stop training as soon as epsilon is small?
I'm new to reinforcement learning.
As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time,…

Micha Christ
- 31
- 1
3
votes
1 answer
Is there an advantage in decaying $\epsilon$ during Q-Learning?
If the agent is following an $\epsilon$-greedy policy derived from Q, is there any advantage to decaying $\epsilon$ even though $\epsilon$ decay is not required for convergence?

KaneM
- 309
- 2
- 13
3
votes
1 answer
What is the difference between the $\epsilon$-greedy and softmax policies?
Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…

FraMan
- 189
- 2
- 10
2
votes
1 answer
Does eligibility traces and epsilon-greedy do the same task in different ways?
I understand that, in Reinforcement Learning algorithms, such as Q-learning, to prevent selecting the actions with greatest q-values too fast and allow for exploration, we use eligibility traces.
Here are some questions
Does $\epsilon$-greedy solve…

Abhishek Dhyani
- 31
- 3
2
votes
1 answer
How to code an $\epsilon$-soft policy for on-policy Monte Carlo control?
I was trying to code the on-policy Monte Carlo control method. The initial policy chosen needs to be an $\epsilon$-soft policy.
Can someone tell me how to code an $\epsilon$-soft policy?
I know how to code the $\epsilon$-greedy. In $\epsilon$-soft,…

A Q
- 23
- 4
2
votes
1 answer
What should the value of epsilon be in the Q-learning?
I am trying to understand Reinforcement Learning and already explored different Youtube videos, blog posts, and Wikipedia articles.
What I don't understand is the impact of $\epsilon$. What value should it take? $0.5$, $0.6$, or $0.7$?
What does it…

Exploring
- 223
- 6
- 16
1
vote
1 answer
Why is my DQN agent not converging to a constant reward?
I'm currently training a DQN agent. I use an epsilon greedy exploration strategy where I decay the epsilon value linearly until it reaches 0 over 300 episodes. For the rest of the remaining 50 episodes, epsilon is always 0. Since the value is 0, I…

gondorian
- 35
- 6