The first thing to note here is that your results seem aligned with the results commonly found in the bandit literature.
Second thing to note would be that the performance of bandit algorithms is usually measured in terms of regret. This is the difference between (i) the amount of rewards accumulated by an oracle policy having prior knowledge about the true rewards of the bandit arms and (ii) the amount of rewards accumulated by the bandit strategy you are evaluating. In other words, this measures the loss your incur due to not knowing what the optimal arm is.
Ideally, the regret should be logarithmic. In theory, all three algorithms achieve logarithmic regret.
See [1] for the regret of decaying $\epsilon$-greedy and UCB and [2] for the regret of Thompson sampling.
Despite their similar regret bounds, these algorithms may behave differently in practice.
For example, according to experiments in [3]:
It appears that Thompson
sampling is more robust than UCB when the delay is long. Thompson sampling alleviates the
influence of delayed feedback$^*$ by randomizing over actions; on the other hand, UCB is deterministic
and suffers a larger regret in case of a sub-optimal choice.
For decaying $\epsilon$-greedy, the fact that it obtained the lowest rewards may be due to the fact that the action selected for exploration is chosen uniformly at random (even if it has a bad reward estimate). In addition, it will explore actions even if they have already been selected many times, unlike UCB which considers the number of action selections to refine the confidence bound.
That said, besides these aspects, there may be other factors that could influence the results obtained by a given bandit algorithm. These may include:
- The reward distributions you chose to model your actions, and their
characteristics.
- The number of actions you have, and how far each
action’s reward is from the optimal action’s reward.
$^*$ Delayed feedback occurs when the outcome of an action selection is not revealed immediately but after some time delay.