Questions tagged [epsilon-greedy-policy]

For questions about the $\epsilon$-greedy policy, which is typically used as a behavioural policy (i.e. a policy used to interact with the environment) during the interaction of reinforcement learning agents with the environment.

27 questions
7
votes
1 answer

What happens when you select actions using softmax instead of epsilon greedy in DQN?

I understand the two major branches of RL are Q-Learning and Policy Gradient methods. From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability…
6
votes
1 answer

What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?

I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following: In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…
6
votes
1 answer

Is this proof of $\epsilon$-greedy policy improvement correct?

The following paragraph about $\epsilon$-greedy policies can be found at the end of page 100, under section 5.4, of the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto (second edition, 2018). but with probability…
5
votes
1 answer

Why does Q-learning converge under 100% exploration rate?

I am working on this assignment where I made the agent learn state-action values (Q-values) with Q-learning and 100% exploration rate. The environment is the classic gridworld as shown in the following picture. Here are the values of my…
5
votes
1 answer

Multi Armed Bandits with large number of arms

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms. Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…
4
votes
1 answer

What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?

I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation: $$\frac{\varepsilon}{|\mathcal{A}(s)|}…
3
votes
2 answers

How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?

In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following: The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal…
3
votes
2 answers

How to fight with unstability in self play?

I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…
3
votes
1 answer

Can we stop training as soon as epsilon is small?

I'm new to reinforcement learning. As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time,…
3
votes
1 answer

Is there an advantage in decaying $\epsilon$ during Q-Learning?

If the agent is following an $\epsilon$-greedy policy derived from Q, is there any advantage to decaying $\epsilon$ even though $\epsilon$ decay is not required for convergence?
3
votes
1 answer

What is the difference between the $\epsilon$-greedy and softmax policies?

Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…
2
votes
1 answer

Does eligibility traces and epsilon-greedy do the same task in different ways?

I understand that, in Reinforcement Learning algorithms, such as Q-learning, to prevent selecting the actions with greatest q-values too fast and allow for exploration, we use eligibility traces. Here are some questions Does $\epsilon$-greedy solve…
2
votes
1 answer

How to code an $\epsilon$-soft policy for on-policy Monte Carlo control?

I was trying to code the on-policy Monte Carlo control method. The initial policy chosen needs to be an $\epsilon$-soft policy. Can someone tell me how to code an $\epsilon$-soft policy? I know how to code the $\epsilon$-greedy. In $\epsilon$-soft,…
2
votes
1 answer

What should the value of epsilon be in the Q-learning?

I am trying to understand Reinforcement Learning and already explored different Youtube videos, blog posts, and Wikipedia articles. What I don't understand is the impact of $\epsilon$. What value should it take? $0.5$, $0.6$, or $0.7$? What does it…
1
vote
1 answer

Why is my DQN agent not converging to a constant reward?

I'm currently training a DQN agent. I use an epsilon greedy exploration strategy where I decay the epsilon value linearly until it reaches 0 over 300 episodes. For the rest of the remaining 50 episodes, epsilon is always 0. Since the value is 0, I…
gondorian
  • 35
  • 6
1
2