I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1]) then $\epsilon$-greedy remains good but UCB1 performance plummets. As an experiment, I just scaled the reward of 1 by a factor of 1/10 which negatively influences the performance.
I have plotted the reward values estimated by the algorithms and see that (due to the confidence interval term) UCB1 largely overestimates the rewards.
Is there a practical trick to fix that? My guess is that the scaling coefficient $c$ in front of the upper confidence bound is meant just for this case. Nevertheless, the difference in performance is staggering to me. How do I know when and what scaling coefficient will be appropriate?
===
update 1 The reward distribution is very simple. There are 17 arms, for arm 3 and 4 the learning algorithm gets a reward of 1, the other arms return reward of 0. No stochasticity, the algorithm has 1000 iterations.
If I scale the reward by a factor 1/10, for instance, then UCB1 takes a whole lot of time to start catching up with $\epsilon$-greedy