Multi Armed Bandits (MABs) are a broad field of research pursuing different streams. In addition to the common objective of maximizing the cumulative reward, there are also so-called Best-Arm (identification) Bandits, cf. Lattimore and Szepesvári (2020), Chapter 33 or Audibert et al. (2010). The goal of the latter algorithms is to identify only the best arm (consequently, they aim to maximize the simple regret instead of the cumulative regret).
Since a Best-Arm Bandit also involves a stochastic environment and also receives rewards by taking actions, I wonder if Best-Arm Bandits belong to the domain of reinforcement learning?
I came across this related post (which, however, refers to Multi Armed Bandits in terms of the cumulative regret). From the responses in the post, my understanding is that the "purely evaluative feedback" should also apply to Best-Arm Bandits.
My intention would categorize Best Armed Bandits as Reinforcement Learning, not only because of the presence of the evaluative feedback but also because they constitute the simplest form of all Reinforcement Learning problems at all, namely a Markov Decision Process with only one single state $s$, a discrete set of actions $a\in A$ and a reward function $r(s,a)$ with the associated goal of maximizing the average reward. For an infinite horizon, this should be the same goal as identifying the best arm.