For questions about the upper confidence bound (UCB)-based algorithms or action selection strategies in the context e.g. of bandit or reinforcement learning problems.
Questions tagged [upper-confidence-bound]
16 questions
5
votes
1 answer
In MCTS, what to do if I do not want to simulate till the end of the game?
I'm trying to implement MCTS with UCT for a board game and I'm kinda stuck. The state space is quite large (3e15), and I'd like to compute a good move in less than 2 seconds. I already have MCTS implemented in Java from here, and I noticed that it…

Sami
- 53
- 4
5
votes
1 answer
What should the initial UCT value be with MCTS, when leaf's simulation count is zero? Infinity?
I am implenting a Monte Carlo Tree Search algorithm, where the selection process is done through Upper Confidence Bound formula:
def uct(state):
log_n = math.log(state.parent.sim_count)
explore_term = self.exploration_weight *…

semyd
- 153
- 1
- 5
5
votes
1 answer
Multi Armed Bandits with large number of arms
I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms.
Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

D. B.
- 101
- 6
4
votes
1 answer
Why do we have two similar action selection strategies for UCB1?
In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

nbro
- 39,006
- 12
- 98
- 176
4
votes
2 answers
Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?
I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)).
As in the latter study, let us consider an…

MAB_N00B
- 41
- 3
4
votes
2 answers
Should I use exploration strategy in Policy Gradient algorithms?
In policy gradient algorithms the output is a stochastic policy - a probability for each action.
I believe that if I follow the policy (sample an action from the policy) I make use of exploration because each action has a certain probability so I…

gnikol
- 175
- 7
3
votes
1 answer
How UCT in MCTS selection phase avoids starvation?
The first step of MCTS is to keep choosing nodes based on Upper Confidence Bound applied to trees (UCT) until it reaches a leaf node where UCT is defined as
$$\frac{w_i}{n_i}+c\sqrt{\frac{ln(t)}{n_i}},$$
where
$w_i$= number of wins after i-th…

user8714896
- 717
- 1
- 4
- 21
3
votes
3 answers
Why aren't exploration techniques, such as UCB or Thompson sampling, used in full RL problems?
Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems?
Monte Carlo Tree Search may use the above-mentioned methods in its selection step, but why do value-based and policy…

Mika
- 331
- 1
- 8
2
votes
1 answer
Difference in UCB performance when scaling the rewards
I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1])…

d56
- 223
- 1
- 7
2
votes
1 answer
In UCB, is the actual upper bound an upper bound of an one-sided or two-sided confidence interval?
I'm a bit confused about the visualization of the upper bound (following the notation of (c.f. Sutton & Barto (2018))
$$Q_t(a)+C\sqrt{\frac{\mathrm{ln}(t)}{N_t(a)}}$$
In many blog posts about the UCB(1)-algorithm, such as visualized in the following…

D. B.
- 101
- 6
2
votes
0 answers
Why is the ideal exploration parameter in the UCT algorithm $\sqrt{2}$?
From Wikipedia, in the Monte-Carlo Tree Search algorithm, you should choose the node that maximizes the value:
$${\displaystyle {\frac {w_{i}}{n_{i}}}+c{\sqrt {\frac {\ln N_{i}}{n_{i}}}}},$$
where
${w_{i}}$ stands for the number of wins for the…

Gilad Felsen
- 21
- 1
1
vote
0 answers
UCB-like algorithms: how do you compute the exploration bonus?
My question concerns Stochastic Combinatorial Multiarmed Bandits. More specifically, the algorithm called CombUCB1 presented in this paper. It is a UCB-like algorithm.
Essentially, in each round of the sequential game the learner chooses a super-arm…

Adam
- 11
- 2
1
vote
1 answer
Why am I getting better performance with Thompson sampling than with UCB or $\epsilon$-greedy in a multi-armed bandit problem?
I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling.
The results for the rewards I got are as follows:
Thompson sampling had the highest average reward
UCB was second
$\epsilon$-greedy was third,…

Java coder
- 11
- 1
- 2
1
vote
1 answer
How do we reach at the formula for UCB action-selection in multi-armed bandit problem?
I came across the formula for Upper Confidence Bound Action Selection (while studying multi-armed bandit problem), which looks like:
$$
A_t \dot{=} \operatorname{argmax}_a \left[ Q_t(a) + c \sqrt{ \frac{\ln t}{N_t(a)} } \right]
$$
Although, I…

SAGALPREET SINGH
- 147
- 1
- 6
0
votes
0 answers
How to use UCB or TS in linear programming?
Consider a sequential decision-making problem over $T$ periods where the parameters of the problem should be learned and also optimize an objective function. One possibility is to model the problem as a dynamic program and use RL techniques to solve…

Amin
- 471
- 2
- 11