1

My question concerns Stochastic Combinatorial Multiarmed Bandits. More specifically, the algorithm called CombUCB1 presented in this paper. It is a UCB-like algorithm.

Essentially, in each round of the sequential game the learner chooses a super-arm to play. A super-arm is a $d$-dimensional vector $a \in \mathcal{A}$ where $\mathcal{A} \subset \{0,1\}^d$. In each super-arm $a$, when the $i$-element equals to $1$ ( $i \in \{0, \dots, d\}, a(i)=1$ ), that means that the basic action $i$ is active. Basically, in each round the learner plays the basic actions that are active in the chosen super-arm. The rewards of the basic actions are stochastics and a super-arm receives as a reward the sum of the rewards of the basic active actions.

The algorithm mentioned above presents a UCB-like algorithm, where with each basic action is associated a UCB-index and in each round the learner plays the super-arm that maximises that index. My question concerns the confidence interval around the mean of the rewards of the basic actions, presented in equation $2$ of the mentioned paper. Here, the exploration bonus is

$c_{t,s} = \sqrt{\frac{1.5 \log t}{s}}$

I don't understand where that $1.5$ is coming from. I've always known that one needs to use Chernoff-Hoeffding inequality to derive the exploration bonus in a UCB-algorithm. Am I wrong and it needs to be computed in other ways? I've always seen the same coefficient but with $2$ instead of $1.5$ (reference). Could someone explain me where does $1.5$ come from, please?

I know there is a similar question here, but I cannot really understand how that works here.

Thank you in advance in case you have time to read and answer my questions.

Adam
  • 11
  • 2
  • The 1.5 seems to arise from the 6. Which seems to arise from 48. In my experience these proofs often uses choices of constants which are clear from hindsight. So I think if you replace all the constants with variables $k_1,k_2,k_3$ you might be able to obtain a general expression which when optimized will give these variables. –  Feb 04 '21 at 10:40
  • Do you mean the $6$ from eq. 6 and $48$ from Theorem 2 in (Kveton et al., 2015)? So, you're saying that if I substitute those constants with variables $k_1, k_2, k_3$ and optimise, I should be able to get back to that 1.5. Have I understood correctly? Thank you for your answer, I really appreciate it! – Adam Feb 05 '21 at 09:28

0 Answers0