UCB-like algorithms: how do you compute the exploration bonus?

Question

My question concerns Stochastic Combinatorial Multiarmed Bandits. More specifically, the algorithm called CombUCB1 presented in this paper. It is a UCB-like algorithm.

Essentially, in each round of the sequential game the learner chooses a super-arm to play. A super-arm is a $d$-dimensional vector $a \in \mathcal{A}$ where $\mathcal{A} \subset \{0,1\}^d$. In each super-arm $a$, when the $i$-element equals to $1$ ( $i \in \{0, \dots, d\}, a(i)=1$ ), that means that the basic action $i$ is active. Basically, in each round the learner plays the basic actions that are active in the chosen super-arm. The rewards of the basic actions are stochastics and a super-arm receives as a reward the sum of the rewards of the basic active actions.

The algorithm mentioned above presents a UCB-like algorithm, where with each basic action is associated a UCB-index and in each round the learner plays the super-arm that maximises that index. My question concerns the confidence interval around the mean of the rewards of the basic actions, presented in equation $2$ of the mentioned paper. Here, the exploration bonus is

$c_{t,s} = \sqrt{\frac{1.5 \log t}{s}}$

I don't understand where that $1.5$ is coming from. I've always known that one needs to use Chernoff-Hoeffding inequality to derive the exploration bonus in a UCB-algorithm. Am I wrong and it needs to be computed in other ways? I've always seen the same coefficient but with $2$ instead of $1.5$ (reference). Could someone explain me where does $1.5$ come from, please?

I know there is a similar question here, but I cannot really understand how that works here.

Thank you in advance in case you have time to read and answer my questions.

The 1.5 seems to arise from the 6. Which seems to arise from 48. In my experience these proofs often uses choices of constants which are clear from hindsight. So I think if you replace all the constants with variables $k_1,k_2,k_3$ you might be able to obtain a general expression which when optimized will give these variables. — , Feb 04 '21 at 10:40
Do you mean the $6$ from eq. 6 and $48$ from Theorem 2 in (Kveton et al., 2015)? So, you're saying that if I substitute those constants with variables $k_1, k_2, k_3$ and optimise, I should be able to get back to that 1.5. Have I understood correctly? Thank you for your answer, I really appreciate it! — Adam, Feb 05 '21 at 09:28

UCB-like algorithms: how do you compute the exploration bonus?

0 Answers0