1

I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling.

The results for the rewards I got are as follows:

  1. Thompson sampling had the highest average reward
  2. UCB was second
  3. $\epsilon$-greedy was third, but relatively close to UCB

Can someone explain to me why the results are like this?

nbro
  • 39,006
  • 12
  • 98
  • 176
Java coder
  • 11
  • 1
  • 2
  • Hi. Just to clarify, in the $\epsilon$-greedy, you used a constant value for $\epsilon$ (if so what was this value?) or did you decay it? – user5093249 Jun 15 '20 at 22:33
  • 1
    for epsilon greedy, epsilon is decayed at each iteration: 1/i, where i is the current iteration – Java coder Jun 15 '20 at 22:41
  • 2
    What was the problem domain? In general these techniques will sort in effectiveness depending on the problem domain. If it is a test or toy problem, please characterise it (how reward functions were set up). Also, can you give more details of your experiment and clarify what you mean *precisely* by "highest average reward" - there is a large difference in terms of approach when considering reward from start of training, or results after training. The number of training episodes is also a big factor. – Neil Slater Jun 16 '20 at 08:13
  • Please, address Neil's comment by editing your post and adding more details about your problem domain, etc., otherwise, I will not re-open this post. – nbro Jun 16 '20 at 10:04
  • Hi @nbro, I see you closed the post. In fact, I agree with Neil's comment, but I've only seen it after finishing typing my answer. Should I delete my answer accordingly? – user5093249 Jun 16 '20 at 11:33
  • 1
    @user5093249 Hi. I didn't fully read your answer, but it looks good. I wouldn't delete it for now. Let's wait a little bit more until the user clarifies his/her question. Maybe your answer will still be valid, as it is. Please, next time, if you find an post that is unclear or needs more details to be answered, you should flag it to be closed. This is a very important work that the community members should do and that rarely do here (on this site). I shouldn't need to close questions. The community should do it. Anyway, thanks for caring about the post and to make the site a better place! – nbro Jun 16 '20 at 11:36
  • This is actually a very good question (albeit laking details). It is an open problem in Online Learning 'Why does TS ahve better performance than UCB even if theoretically UCB is the better of the two algo?" –  Dec 25 '20 at 11:50

1 Answers1

3

The first thing to note here is that your results seem aligned with the results commonly found in the bandit literature.

Second thing to note would be that the performance of bandit algorithms is usually measured in terms of regret. This is the difference between (i) the amount of rewards accumulated by an oracle policy having prior knowledge about the true rewards of the bandit arms and (ii) the amount of rewards accumulated by the bandit strategy you are evaluating. In other words, this measures the loss your incur due to not knowing what the optimal arm is.

Ideally, the regret should be logarithmic. In theory, all three algorithms achieve logarithmic regret. See [1] for the regret of decaying $\epsilon$-greedy and UCB and [2] for the regret of Thompson sampling.

Despite their similar regret bounds, these algorithms may behave differently in practice. For example, according to experiments in [3]:

It appears that Thompson sampling is more robust than UCB when the delay is long. Thompson sampling alleviates the influence of delayed feedback$^*$ by randomizing over actions; on the other hand, UCB is deterministic and suffers a larger regret in case of a sub-optimal choice.

For decaying $\epsilon$-greedy, the fact that it obtained the lowest rewards may be due to the fact that the action selected for exploration is chosen uniformly at random (even if it has a bad reward estimate). In addition, it will explore actions even if they have already been selected many times, unlike UCB which considers the number of action selections to refine the confidence bound.

That said, besides these aspects, there may be other factors that could influence the results obtained by a given bandit algorithm. These may include:

  • The reward distributions you chose to model your actions, and their characteristics.
  • The number of actions you have, and how far each action’s reward is from the optimal action’s reward.

$^*$ Delayed feedback occurs when the outcome of an action selection is not revealed immediately but after some time delay.

user5093249
  • 722
  • 4
  • 8