For questions about the contextual bandit (CB) problem and algorithms that solve it. The CB problem is a generalization of the (context-free) multi-armed bandit problem, where there is more than one situation (or state) and the optimal action to take in one state may be different than the optimal action to take in another state, but where the actions do not affect states (as e.g. in the reinforcement learning problem), but only the rewards.
Questions tagged [contextual-bandits]
21 questions
8
votes
2 answers
What is the relation between the context in contextual bandits and the state in reinforcement learning?
Conceptually, in general, how is the context being handled in contextual bandits (CB), compared to states in reinforcement learning (RL)?
Specifically, in RL, we can use a function approximator (e.g. a neural network) to generalize to other states.…

Maxim Volgin
- 183
- 2
- 8
5
votes
1 answer
Can you convert a MDP problem to a Contextual Multi-Arm Bandits problem?
I'm trying to get a better understanding of Multi-Arm Bandits, Contextual Multi-Arm Bandits and Markov Decision Process.
Basically, Multi-Arm Bandits is a special case of Contextual Multi-Arm Bandits where there is no state(features/context). And…

peidaqi
- 151
- 1
5
votes
2 answers
Are bandits considered an RL approach?
If a research paper uses multi-armed bandits (either in their standard or contextual form) to solve a particular task, can we say that they solved this task using a reinforcement learning approach? Or should we distinguish between the two and use…

user5093249
- 722
- 4
- 8
3
votes
1 answer
How to implement a contextual reinforcement learning model?
In a reinforcement learning model, states depend on the previous actions chosen. In the case in which some of the states -but not all- are fully independent of the actions -but still obviously determine the optimal actions-, how could we take these…

freesoul
- 246
- 1
- 5
3
votes
1 answer
Can I apply DQN or policy gradient algorithms in the contextual bandit setting?
I have a problem which I believe can be described as a contextual bandit.
More specifically, in each round, I observe a context from the environment consisting of five continuous features, and, based on the context, I have to choose one of the ten…

gnikol
- 175
- 7
2
votes
0 answers
Is it better to model a Contextual Multi-Armed Bandit problem as an MDP with a non-zero discount factor than treating it as it is?
I'd like to ask if it is, generally, better to model a problem that naturally appears as a Contextual Multi-Armed Bandit like Recommender Systems as a Markov Decision Process with a non-zero discount factor (otherwise it's just an MDP with one step…

Daviiid
- 563
- 3
- 15
2
votes
0 answers
Is there a UCB type algorithm for linear stochastic bandit with lasso regression?
Why is there no upper confidence bound algorithm for linear stochastic bandits that uses lasso regression in the case that the regression parameters are sparse in the features?
In particular, I don't understand what is hard about lasso regression…

PJORR
- 21
- 2
1
vote
1 answer
Why is it useful in some applications to use features that are shared by all arms?
In Li et al. (2010)'s highly cited paper, they talk about LinUCB with hybrid linear models in Section 3.2.
They motivate this by saying
In many applications including ours, it is helpful to use features that are shared by all arms, in addition to…

wwl
- 153
- 5
1
vote
1 answer
How can I incorporate domain knowledge to choose actions in the case of large action spaces in multi-armed bandits?
Suppose one is using a multi-armed bandit, and one has relatively few "pulls" (i.e. timesteps) relative to the action set. For example, maybe there are 200 timesteps and 100 possible actions.
However, you do have information on how similar actions…

wwl
- 153
- 5
1
vote
0 answers
Name of a multiarmed bandit with only some levers available
In order to model a card game, as an exercise, I was thinking of an elementary setting as a multiarmed bandit, each lever being the distribution of expected rewards of a specific card.
But, of course, the player only has some cards in the hand each…

arivero
- 51
- 7
1
vote
0 answers
Policy gradient (or more general, RL algorithms) for the problems where actions does not determine next state (next state is independent to action)
I am pretty new in RL. Could anyone suggest results/paper about whether or not policy gradient (or more general RL algorithms) can be applied to the problems where actions does not determine next state? e.g. next state is independent to action…

Penn
- 11
- 2
1
vote
1 answer
How to handled delayed rewards in contextual bandits
All the examples I see in the tf_Agents for contextual bandits, involves a reward function we generated the reward instantly after an observation has been generated.
But, in my real world usecase (say sending emails and waiting for the click rate),…

tjt
- 111
- 3
1
vote
0 answers
Multi-armed Bandit in optimization on graph edges selection
I have the problem, which I described below. I wonder if there exists a class of multi-armed bandit approaches that is related to it.
I am working on computer networking optimization.
In the simplest scenario, we model the network as a graph with a…

Ramon
- 21
- 1
1
vote
1 answer
Why do I get bad results no matter my neural network function approximator for parametrized Q-learning implementation for Contextual Bandits?
I'd like to ask you why, no matter my neural network function approximator for parametrized Q-learning implementation for a Contextual Bandits environment, I'm getting bad results. I don't know if it's a problem with my formulation of the problem…

Daviiid
- 563
- 3
- 15
1
vote
0 answers
(explore-exploit + supervised learning ) vs contextual bandits
Lets take an ad recommendation problem for 1 slot. Feedback is click/no click. I can solve this by contextual bandits. But I can also introduce exploration in supervised learning, I learn my model from collected data every k hours.
What can…

dksahuji
- 111
- 2