Can I apply DQN or policy gradient algorithms in the contextual bandit setting?

Question

I have a problem which I believe can be described as a contextual bandit.

More specifically, in each round, I observe a context from the environment consisting of five continuous features, and, based on the context, I have to choose one of the ten available actions. The actions do not influence the next context.

Based on the above I have the following questions:

Is this a contextual bandit or an MDP with a discount equal to zero (one step RL)? I have read that, in contextual bandits, we receive a different context for each action and I am a little bit confused.
Can I use the DQN algorithm with TD Target only the observed reward instead of the reward plus the predicted value of the next state?
Can I use a policy gradient algorithm, like REINFORCE or A2C? If yes, should I use a baseline and what this baseline should be?
I have seen in the literature that there are some algorithms for contextual bandits such as LinUCB, LinRel, NeuralBandit, etc. And I am wondering why the DQN, A2C and REINFORCE algorithms, which seem to work well in MDP setting, are not used in contextual bandits, given the fact that this problem can be described as an MDP with a discount equal to zero?

The comments have been [moved to chat](https://chat.stackexchange.com/rooms/109002/discussion-on-question-by-gnikol-can-i-apply-dqn-or-policy-gradient-algorithms-i). — nbro, Jun 06 '20 at 21:55

Neil Slater · Accepted Answer · 2020-06-06T21:11:47.910

MDPs are strict generalisations of contextual bandits, adding time steps and state transitions, plus the concept of return as a measure of agent performance.

Therefore, methods used in RL to solve MDPs will work to solve contextual bandits. You can either treat a contextual bandit as a series of 1-step episodes (with start state chosen randomly), or as a continuous problem with discount factor zero.

Can I use the DQN algorithm with TD Target only the observed reward instead of the reward plus the predicted value of the next state?

Yes. That's identical mathematically to having a discount of zero, or having 1-step episodes.

Can I use a policy gradient algorithm, like REINFORCE or A2C? If yes, should I use a baseline and what this baseline should be?

Yes. Once converted to an MDP, you can use the same baselines in these algorithms as normal (A2C's use of advantage instead of action value is already a baseline). Generally adding a baseline can help reduce variance, so it may still help when applying RL to contextual bandit problems.

I have seen in the literature that there are some algorithms for contextual bandits such as LinUCB, LinRel, NeuralBandit, etc. And I am wondering why the DQN, A2C and REINFORCE algorithms, which seem to work well in MDP setting, are not used in contextual bandits

There are a couple of reasons that contextual bandit problems are not solved using RL techniques more often:

The goal in contextual bandits is commonly focused on creating a highly efficient online learner that minimises regret. Regret is the long term difference in total reward between always exploiting the best action choice compared to the exploration necessary to find it. Some RL solvers - e.g. DQN - are poor by this metric.
The lack of timesteps and state transitions can be used in algorithm design to be more efficient.
Improvements to RL methods designed to help with sparse rewards and the assignment problem in MDPs are pointless for contextual bandits and may be wasteful or even counter-productive.

Some RL algorithms do resolve to be nearly identical to their contextual bandit counterparts, and have the same performance characteristics e.g. REINFORCE with baseline for 1-step episodes is essentially the Contextual Gradient Bandit algorithm.

It is also worth noting than many problem domains where contextual bandit algorithms do well - e.g. website recommendations and advertising - have research showing that that a more sophisticated MDP model and RL-like approach can do even better. Although that is not quite the same as your question, it typically means extending the model so that timesteps and state transitions are meaningful.

thank you very much for your answer. 1. Can you please share a link that describes the Contextual Gradient Bandit algorithm? 2. I believe that the regret performance of DQN has to do with the exploration strategy. If we use a more sophisticated exploration (for example https://arxiv.org/pdf/1606.01868.pdf) the performance will be better. — gnikol, Jun 06 '20 at 21:06
Thank you Neil for the link of the Contextual Gradient Bandit algorithm. Do you have a link where a counterpart of DQN is applied in Contextual Bandit setting? Finally, I am struggling to understand what the baseline should be. For example, in REINFORCE the baseline is the average returns of m trajectories by following the policy. By analogy, in Contextual Bandit setting, the baseline might be just the average reward of the policy. Correct? — gnikol, Jun 06 '20 at 22:04
@gnikol No I don't have such a link, sorry. Once you convert to contextual bandit setting to MDP setting, there is no difference to a DQN used normally, so I am not sure what you are hoping to see. The same answer for REINFORCE - convert CB to MDP and use the same baseline (return) - yes that is the same as average reward, but if you are spending all your time trying to adapt the algorithm to something that is not quite an MDP, you'd be better off just using a contextual bandit solver instead anyway. — Neil Slater, Jun 06 '20 at 22:07

Can I apply DQN or policy gradient algorithms in the contextual bandit setting?

1 Answers1