For questions related to the Monte Carlo methods in reinforcement learning and other AI sub-fields. ("Monte Carlo" refers to random sampling of the search space.)
Questions tagged [monte-carlo-methods]
78 questions
20
votes
2 answers
What is the difference between First-Visit Monte-Carlo and Every-Visit Monte-Carlo Policy Evaluation?
I came across these 2 algorithms, but I cannot understand the difference between these 2, both in terms of implementation as well as intuitionally.
So, what difference does the second point in both the slides refer to?
user9947
17
votes
1 answer
How does "Monte-Carlo search" work?
I have heard about this concept in a Reddit post about AlphaGo. I have tried to go through the paper and the article, but could not really make sense of the algorithm.
So, can someone give an easy-to-understand explanation of how the Monte-Carlo…

Dawny33
- 1,371
- 13
- 29
9
votes
2 answers
What is the intuition behind TD($\lambda$)?
I'd like to better understand temporal-difference learning. In particular, I'm wondering if it is prudent to think about TD($\lambda$) as a type of "truncated" Monte Carlo learning?

Nick Kunz
- 145
- 1
- 5
8
votes
1 answer
How to fill in missing transitions when sampling an MDP transition table?
I have a simulator modelling a relatively complex scenario. I extract ~12 discrete features from the simulator state which forms the basis for my MDP state space.
Suppose I am estimating the transition table for an MDP by running a large number of…

Brendan Hill
- 263
- 1
- 6
8
votes
1 answer
MCTS: How to choose the final action from the root
When the time allotted to Monte Carlo tree search runs out, what action should be chosen from the root?
The original UCT paper (2006) says bestAction in their algorithm.
Monte-Carlo Tree Search: A New Framework for Game AI (2008) says
The game…

user76284
- 347
- 1
- 14
5
votes
1 answer
In MCTS, what to do if I do not want to simulate till the end of the game?
I'm trying to implement MCTS with UCT for a board game and I'm kinda stuck. The state space is quite large (3e15), and I'd like to compute a good move in less than 2 seconds. I already have MCTS implemented in Java from here, and I noticed that it…

Sami
- 53
- 4
5
votes
1 answer
Why do we need importance sampling?
I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling.
For instance,…

Alireza Hosseini
- 51
- 2
5
votes
1 answer
Why does TD Learning require Markovian domains?
One of my friends and I were discussing the differences between Dynamic Programming, Monte-Carlo, and Temporal Difference (TD) Learning as policy evaluation methods - and we agreed on the fact that Dynamic Programming requires the Markov assumption…

stoic-santiago
- 1,121
- 5
- 18
5
votes
2 answers
How can we compute the ratio between the distributions if we don't know one of the distributions?
Here is my understanding of importance sampling. If we have two distributions $p(x)$ and $q(x)$, where we have a way of sampling from $p(x)$ but not from $q(x)$, but we want to compute the expectation wrt $q(x)$, then we use importance sampling.…

pecey
- 313
- 2
- 9
4
votes
1 answer
Why is GLIE Monte-Carlo control an on-policy control?
In slide 16 of his lecture 5 of the course "Reinforcement Learning", David Silver introduced GLIE Monte-Carlo Control.
But why is it an on-policy control? The sampling follows a policy $\pi$ while improvement follows an $\epsilon$-greedy policy, so…

fish_tree
- 247
- 1
- 6
4
votes
2 answers
Why is the target called "target" in Monte Carlo and TD learning if it is not the true target?
I was going through Sutton's book and, using sample-based learning for estimating the expectations, we have this formula
$$
\text{new estimate} = \text{old estimate} + \alpha(\text{target} - \text{old estimate})
$$
What I don't quite understand is…

Chukwudi Ogbonna
- 125
- 4
4
votes
1 answer
Why are state-values alone not sufficient in determining a policy (without a model)?
"If a model is not available, then it is particularly useful to estimate action values (the
values of state-action pairs) rather than state values. With a model, state values alone are
sufficient to determine a policy; one simply looks ahead one…

stoic-santiago
- 1,121
- 5
- 18
4
votes
1 answer
What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?
I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation:
$$\frac{\varepsilon}{|\mathcal{A}(s)|}…

Metrician
- 95
- 5
4
votes
1 answer
How does policy evaluation work for continuous state space model-free approaches?
How does policy evaluation work for continuous state space model-free approaches?
Theoretically, a model-based approach for the discrete state and action space can be computed via dynamic programming and solving the Bellman equation.
Let's say you…

calveeen
- 1,251
- 7
- 17
4
votes
1 answer
How does Monte Carlo have high variance?
I was going through David Silver's lecture on reinforcement learning (lecture 4). At 51:22 he says that Monte Carlo (MC) methods have high variance and zero bias. I understand the zero bias part. It is because it is using the true value of value…

Bhuwan Bhatt
- 394
- 1
- 11