Highest Voted 'multi-armed-bandits' Questions - Artificial Intelligence Stack Exchange

6

votes

1 answer

What is a weighted average in a non-stationary k-armed bandit problem?

In the book Reinforcement Learning: An Introduction (page 25), by Richard S. Sutton and Andrew G. Barto, there is a discussion of the k-armed bandit problem, where the expected reward from the bandits changes slightly over time (that is, the problem…

asked Jan 18 '18 at 18:32

chessprogrammer

2,215
2
12
23

6

votes

1 answer

What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?

I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following: In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…

reinforcement-learning multi-armed-bandits epsilon-greedy-policy

asked May 23 '21 at 20:48

Daviiid

563
3
15

6

votes

1 answer

How do I recognise a bandit problem?

I'm having difficulty understanding the distinction between a bandit problem and a non-bandit problem. An example of the bandit problem is an agent playing $n$ slot machines with the goal of discovering which slot machine is the most probable to…

reinforcement-learning terminology definitions multi-armed-bandits

asked Jun 08 '20 at 13:06

blue-sky

325
1
11

5

votes

1 answer

What are the major differences between multi-armed bandits and the other well-known algorithms (DQN, A3C, PPO, etc)?

I have studied in the past different algorithms, i.e. DQN, DDQN, REINFORCE, A3C, PPO, TRPO, so on. I am doing an internship this summer where I have to use a multi-armed bandit (MAB). I am a bit confused between MAB and the other above…

reinforcement-learning comparison deep-rl multi-armed-bandits

asked May 07 '21 at 15:26

notaprogrammertoday

53
1
6

5

votes

1 answer

Multi Armed Bandits with large number of arms

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms. Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

reinforcement-learning multi-armed-bandits epsilon-greedy-policy upper-confidence-bound

asked Dec 16 '20 at 16:03

D. B.

101
6

5

votes

1 answer

Can you convert a MDP problem to a Contextual Multi-Arm Bandits problem?

I'm trying to get a better understanding of Multi-Arm Bandits, Contextual Multi-Arm Bandits and Markov Decision Process. Basically, Multi-Arm Bandits is a special case of Contextual Multi-Arm Bandits where there is no state(features/context). And…

reinforcement-learning comparison markov-decision-process multi-armed-bandits contextual-bandits

asked Aug 17 '20 at 03:17

peidaqi

151
1

5

votes

2 answers

Are bandits considered an RL approach?

If a research paper uses multi-armed bandits (either in their standard or contextual form) to solve a particular task, can we say that they solved this task using a reinforcement learning approach? Or should we distinguish between the two and use…

reinforcement-learning terminology multi-armed-bandits contextual-bandits

asked May 02 '20 at 14:42

user5093249

722
4
8

5

votes

1 answer

It is possible to solve a problem with continuous action spaces and no states with reinforcement learning?

I want to use Reinforcement Learning to optimize the distribution of energy for a peak shaving problem given by a thermodynamical simulation. However, I am not sure how to proceed as the action space is the only thing that really matters, in this…

reinforcement-learning policy-gradients multi-armed-bandits state-spaces continuous-action-spaces

asked Jun 12 '19 at 14:28

FS93

145
6

4

votes

1 answer

Is the Bandit Problem an MDP?

I've read Sutton and Barto's introductory RL book. They define a policy as a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ as the probability of taking…

reinforcement-learning comparison markov-decision-process notation multi-armed-bandits

asked Jun 15 '21 at 08:44

Snowball

213
1
6

4

votes

1 answer

Why do we have two similar action selection strategies for UCB1?

In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

reinforcement-learning policies sutton-barto multi-armed-bandits upper-confidence-bound

asked Oct 23 '20 at 10:32

nbro

39,006
12
98
176

4

votes

2 answers

Why is regret so defined in MABs?

Consider a multi-armed bandit(MAB). There are $k$ arms, with reward distributions $R_i$ where $1 \leq i \leq k$. Let $\mu_i$ denote the mean of the $i^{th}$ distribution. If we run the multi-armed bandit experiment for $T$ rounds, the "pseudo…

reinforcement-learning definitions rewards multi-armed-bandits regret

asked Sep 26 '20 at 15:45

stoic-santiago

1,121
5
18

4

votes

2 answers

Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)). As in the latter study, let us consider an…

papers notation multi-armed-bandits upper-confidence-bound

asked Jul 16 '20 at 13:41

MAB_N00B

41
3

3

votes

1 answer

How to implement a contextual reinforcement learning model?

In a reinforcement learning model, states depend on the previous actions chosen. In the case in which some of the states -but not all- are fully independent of the actions -but still obviously determine the optimal actions-, how could we take these…

reinforcement-learning multi-armed-bandits contextual-bandits

asked Jun 12 '18 at 14:02

freesoul

246
1
5

3

votes

3 answers

Why aren't exploration techniques, such as UCB or Thompson sampling, used in full RL problems?

Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems? Monte Carlo Tree Search may use the above-mentioned methods in its selection step, but why do value-based and policy…

reinforcement-learning multi-armed-bandits upper-confidence-bound thompson-sampling exploration-strategies

asked Nov 15 '20 at 19:10

Mika

331
1
8

3

votes

0 answers

Mapping given probabilities to empirical probabilities

Consider following problem statement: You have given $n$ actions. You can perform any of them. Each action gives you success with some probability. The challenge is to perform given finite number of actions to get maximum successes. Here, we can…

reinforcement-learning math statistical-ai multi-armed-bandits

asked Sep 23 '20 at 21:48

Rnj

221
2
6

Questions tagged [multi-armed-bandits]