Difference in UCB performance when scaling the rewards

Question

I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1]) then $\epsilon$-greedy remains good but UCB1 performance plummets. As an experiment, I just scaled the reward of 1 by a factor of 1/10 which negatively influences the performance.

I have plotted the reward values estimated by the algorithms and see that (due to the confidence interval term) UCB1 largely overestimates the rewards.

Is there a practical trick to fix that? My guess is that the scaling coefficient $c$ in front of the upper confidence bound is meant just for this case. Nevertheless, the difference in performance is staggering to me. How do I know when and what scaling coefficient will be appropriate?

===

update 1 The reward distribution is very simple. There are 17 arms, for arm 3 and 4 the learning algorithm gets a reward of 1, the other arms return reward of 0. No stochasticity, the algorithm has 1000 iterations.

If I scale the reward by a factor 1/10, for instance, then UCB1 takes a whole lot of time to start catching up with $\epsilon$-greedy

Can you specify the continuous reward distribution? Not very clear what you mean by "common value". — Kostya, Apr 23 '21 at 16:23
UCB1 is known to scale poorly with the number of arms. If you have the true expected rewards to be close (I.e. 0 vs 0.1) then it will take UCB1 longer with a lot of arms than it would with a two rewards that are more different (I.e. 0 and 1). — David, Apr 24 '21 at 09:10
@DavidIreland as far as I remember in the LinUCB paper they had quite a lot of arms (articles to recommend). Maybe it worked as they had so much data too? — d56, Apr 26 '21 at 08:55
hmm, I am not familiar with LinUCB so it may well be that there are new methods available that can scale well with arms. I am only really familiar UCB1 and it has been a _long_ time since I looked at it, hence why I can't answer your question. — David, Apr 26 '21 at 09:44

James · Answer 1 · 2021-04-26T13:55:55.123

1

Epsilon greedy is unaffected by scaling of rewards, it always selects a random action with a probability of epsilon.

On the other hand, if we look at the formulation of UCB (Section 2.7 of Reinforcement Learning, Sutton and Barto):

$$A_t \doteq \underset{a}{\operatorname{argmax}} [\mathcal{Q}_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}]$$

Where $Q_t(a)= \frac{R_1 + R_2 + \dots + R_{n-1}}{n-1}$ is the average of rewards seen so far for that action. By scaling the rewards, for example by 1/10, you are essentially scaling $Q_t(a)$ by 1/10 which will change UCBs behaviour to favour the exploratory right hand term $c\sqrt{\frac{\ln t}{N_t(a)}}$. If you want to offset your scaling of the rewards you should scale c by the same amount. Theoretically, I believe this should result in performance the same as that achieved pre-scaling.

In fact, another way of looking at it is that when using UCB, you could dispense with the $c$ constant altogether and simply scale the rewards to tune the exploitation/exploration trade-off. Smaller rewards favour exploration, larger rewards favour exploitation.

edited Apr 26 '21 at 13:55

answered Apr 26 '21 at 13:45

James

241
1
5

I think the UCB OP is talking about is a more formal one whose error bounds can be determined. S&B uses an informal version of UCB. – Apr 26 '21 at 14:17
I double checked what I believe to be the original paper [P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem]. The UCB1 discussed in this paper and the one presented in Sutton and Barto seem to be essentially the same to me, unless I am missing something. – James Apr 26 '21 at 14:24
I mean there are fundamental gurantees for UCB1, In UCB1, c=2 I think, for a very specific reason. But I don't exactly know what OP wants so only OP can clarify. – Apr 26 '21 at 14:26
I see, yes, in the original paper they provide guarantees for c = sqrt(2). The reasoning behind why performance is different with scaled rewards remains the same though. The fact that OP referred to scaling c leads me to think they might be using the Sutton and Barto formulation. – James Apr 26 '21 at 14:29
@James in fact, in my task the reward is a ratio between the users that respond to the effect associated with triggering arm a_n and the total number of users. Then, there is no constant scaling factor for the reward, it varies depending on the situation. Is UCB1 designed to work only with binary rewards and is not suitable for continuous rewards in range [0,1]? I do not think so, as the LinUCB was using click-through-rate as a reward which is also variable and non-binary – d56 Apr 27 '21 at 12:29
@d56 UCB can work with continuous or discrete rewards. If scaling rewards doesn't seem appropriate you should tune the c hyper-parameter of UCB. If the current behaviour is to explorative lower the value of c, if it is to exploitive raise the value of c. – James Apr 27 '21 at 12:45
yes, so, I do not actually know the distribution of the rewards except that they are non-binary. So, I suppose, as you have noted, that tuning the c hyper-parameter is the only choice. That makes things somewhat more complicated of course... – d56 Apr 27 '21 at 12:51

Difference in UCB performance when scaling the rewards

1 Answers1