0

Multi Armed Bandits (MABs) are a broad field of research pursuing different streams. In addition to the common objective of maximizing the cumulative reward, there are also so-called Best-Arm (identification) Bandits, cf. Lattimore and Szepesvári (2020), Chapter 33 or Audibert et al. (2010). The goal of the latter algorithms is to identify only the best arm (consequently, they aim to maximize the simple regret instead of the cumulative regret).

Since a Best-Arm Bandit also involves a stochastic environment and also receives rewards by taking actions, I wonder if Best-Arm Bandits belong to the domain of reinforcement learning?

I came across this related post (which, however, refers to Multi Armed Bandits in terms of the cumulative regret). From the responses in the post, my understanding is that the "purely evaluative feedback" should also apply to Best-Arm Bandits.

My intention would categorize Best Armed Bandits as Reinforcement Learning, not only because of the presence of the evaluative feedback but also because they constitute the simplest form of all Reinforcement Learning problems at all, namely a Markov Decision Process with only one single state $s$, a discrete set of actions $a\in A$ and a reward function $r(s,a)$ with the associated goal of maximizing the average reward. For an infinite horizon, this should be the same goal as identifying the best arm.

D. B.
  • 101
  • 6
  • Can you provide a reference that describes this "Best-Arm (identification) Bandits" (so that we are all have your context)? – nbro Feb 17 '22 at 14:20
  • 1
    I added two references regarding Best-Arm Bandits, thank you for pointing it out. – D. B. Feb 17 '22 at 15:34
  • The pure goal of identifying the "best arm" without other constraints is a pretty dull problem. Pull each arm millions of times, calculate the mean reward, the highest mean is the estimate for the "best arm". There must be some constraint or complication to make the subject noteworthy enough to be named. With MABs it is minimising regret whilst learning. For BABs, there must be something else - is it something you understand from the papers and can summarise in the question? – Neil Slater Feb 17 '22 at 15:46
  • For example, are BAB's rated on an accuracy to identify the "best" vs number of learning steps so far, thus constrained by time and/or confidence measures? – Neil Slater Feb 17 '22 at 15:50
  • No this is by no means a dull problem. Not only the domain of Best Arm Bandits focuses on this type of problem but also other popular streams of research such as "Ranking and Selection". In practice, it is not possible to pull each arm, for example, 1 million times. This would is way too computationally expensive (in particular when there is a huge number of arms). Instead, often a certain computational budget (of iterations or simulations) is assumed, within which the best arm needs to be identified. – D. B. Feb 17 '22 at 18:13
  • 1
    Bandit problems are definitely a type of reinforcement learning, but I feel it would be strange to call an algorithm which solves a bandit problem a "reinforcement learning algorithm" for the same reason it would be strange to call a square a rectangle. – Taw Feb 17 '22 at 22:39
  • Was about to answer, but then realised you don't give any motivation for categorising BAB's as reinforcement learning. Other than it is clearly a related subject, what is your purpose for making a categorisation? Is it so you can file your notes on the subject under a particular heading? Do you wish to use any particular result or theory from RL in order to solve a BAB problem? – Neil Slater Feb 18 '22 at 08:01

0 Answers0