For questions related to the multi-armed bandit (MAB) problem, in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation.
Questions tagged [multi-armed-bandits]
57 questions
6
votes
1 answer
What is a weighted average in a non-stationary k-armed bandit problem?
In the book Reinforcement Learning: An Introduction (page 25), by Richard S. Sutton and Andrew G. Barto, there is a discussion of the k-armed bandit problem, where the expected reward from the bandits changes slightly over time (that is, the problem…

chessprogrammer
- 2,215
- 2
- 12
- 23
6
votes
1 answer
What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?
I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following:
In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…

Daviiid
- 563
- 3
- 15
6
votes
1 answer
How do I recognise a bandit problem?
I'm having difficulty understanding the distinction between a bandit problem and a non-bandit problem.
An example of the bandit problem is an agent playing $n$ slot machines with the goal of discovering which slot machine is the most probable to…

blue-sky
- 325
- 1
- 11
5
votes
1 answer
What are the major differences between multi-armed bandits and the other well-known algorithms (DQN, A3C, PPO, etc)?
I have studied in the past different algorithms, i.e. DQN, DDQN, REINFORCE, A3C, PPO, TRPO, so on. I am doing an internship this summer where I have to use a multi-armed bandit (MAB). I am a bit confused between MAB and the other above…

notaprogrammertoday
- 53
- 1
- 6
5
votes
1 answer
Multi Armed Bandits with large number of arms
I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms.
Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

D. B.
- 101
- 6
5
votes
1 answer
Can you convert a MDP problem to a Contextual Multi-Arm Bandits problem?
I'm trying to get a better understanding of Multi-Arm Bandits, Contextual Multi-Arm Bandits and Markov Decision Process.
Basically, Multi-Arm Bandits is a special case of Contextual Multi-Arm Bandits where there is no state(features/context). And…

peidaqi
- 151
- 1
5
votes
2 answers
Are bandits considered an RL approach?
If a research paper uses multi-armed bandits (either in their standard or contextual form) to solve a particular task, can we say that they solved this task using a reinforcement learning approach? Or should we distinguish between the two and use…

user5093249
- 722
- 4
- 8
5
votes
1 answer
It is possible to solve a problem with continuous action spaces and no states with reinforcement learning?
I want to use Reinforcement Learning to optimize the distribution of energy for a peak shaving problem given by a thermodynamical simulation. However, I am not sure how to proceed as the action space is the only thing that really matters, in this…

FS93
- 145
- 6
4
votes
1 answer
Is the Bandit Problem an MDP?
I've read Sutton and Barto's introductory RL book. They define a policy as a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ as the probability of taking…

Snowball
- 213
- 1
- 6
4
votes
1 answer
Why do we have two similar action selection strategies for UCB1?
In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

nbro
- 39,006
- 12
- 98
- 176
4
votes
2 answers
Why is regret so defined in MABs?
Consider a multi-armed bandit(MAB). There are $k$ arms, with reward distributions $R_i$ where $1 \leq i \leq k$. Let $\mu_i$ denote the mean of the $i^{th}$ distribution.
If we run the multi-armed bandit experiment for $T$ rounds, the "pseudo…

stoic-santiago
- 1,121
- 5
- 18
4
votes
2 answers
Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?
I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)).
As in the latter study, let us consider an…

MAB_N00B
- 41
- 3
3
votes
1 answer
How to implement a contextual reinforcement learning model?
In a reinforcement learning model, states depend on the previous actions chosen. In the case in which some of the states -but not all- are fully independent of the actions -but still obviously determine the optimal actions-, how could we take these…

freesoul
- 246
- 1
- 5
3
votes
3 answers
Why aren't exploration techniques, such as UCB or Thompson sampling, used in full RL problems?
Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems?
Monte Carlo Tree Search may use the above-mentioned methods in its selection step, but why do value-based and policy…

Mika
- 331
- 1
- 8
3
votes
0 answers
Mapping given probabilities to empirical probabilities
Consider following problem statement:
You have given $n$ actions. You can perform any of them. Each action gives you success with some probability. The challenge is to perform given finite number of actions to get maximum successes.
Here, we can…

Rnj
- 221
- 2
- 6