3

In Bandit Based Monte-Carlo Planning, the article where UCT is introduced as a planning algorithm, there is an algorithm description in page 285 (4 of the pdf).

Comparing this implementation of UCT (a specific type of MCTS algorithm) to the application normally used in games, there is one major difference. Here, the rewards are calculated for every state, instead of only doing an evaluation at the end of the simulation.

My questions are (they are all related to each other):

  • Is this the only big difference? Or in other words, can I do the same implementation as in MCTS for games with the 4 stages: selection, expansion, simulation and backpropagation, where the result of the simulation is the accumulated reward instead of a value between 0 and 1? How would the UCT selection be adjusted in this case?

  • What does the UpdateValue function in line 12 does exactly? In the text it says it is used to adjust the state-action pair value at a given depth, will this be used to the selection? How is this calculated exactly?

  • What is the depth parameter needed for? Is it related with the UpdateValue?

Finally I would like to ask if you know any other papers where a clear implementation of UCT for planning is used with multiple rewards, not only on the end of the simulation.

Miguel Saraiva
  • 767
  • 1
  • 5
  • 14
  • 1
    Hi Miguel. Please, next time ask one question per post. In the future, it unlikely that someone will have exactly the same questions, so this post will be useful only for you. Furthermore, you facilitate the life of the answerers: some people might know the answer to some question, but not all of them, so they might not give an answer to your questions because they do not know the answers to all of your questions. Does this make any sense to you? In this case, you have someone that is an expert in MCTS, etc., but this will not always be the case. – nbro Jul 16 '19 at 22:48
  • 1
    Sure. No problem – Miguel Saraiva Jul 17 '19 at 09:40

1 Answers1

2

Is this the only big difference? Or in other words, can I do the same implementation as in a simple MCTS with the 4 stages: selection, expansion, simulation and backpropagation, where the result of the simulation is the accumulated reward instead of a value between 0 and 1? How would the UCT selection be adjusted in this case?

No, this is not the only difference. The important differences between the "original UCT" (as described in the paper you linked), and the "standard UCT" (as typically implemented in game AI research) are nicely summarised in Subsection 2.4 of "On Monte Carlo Tree Search and Reinforcement Learning". I'd recommend reading the entire paper, or large parts of it, if you have the time, but to summarise the key differences:

  • Original stores as many nodes as allowed by memory limitations per iteration, standard only expands one node per iteration.
  • Original uses transpositions, standard implementation does not
  • Original uses discount factor $\gamma$, standard does not (equivalent to picking $\gamma = 1$)
  • Original takes into account at which point in time rewards were observed, standard does not

That last point is the one your question mainly seems to be about. Your idea of using a standard implementation, but accumulating any intermediate rewards and treating them as one large terminal reward, would not be the same. Actually taking into account the times at which rewards were observed, as done in the original UCT, can lead to more accurate value estimates in nodes in the middle of the tree (or at least to finding them more quickly). Your idea would be equivalent to learning from Monte-Carlo backups in Reinforcement Learning literature (e.g. Sutton and Barto's book), whereas the original UCT implementation would be more similar to learning from TD($0$) backups.

Note that, when talking about the "standard" implementation of UCT, that's very often in papers about two-player zero-sum games in which all intermediate rewards have a value of $0$, discounting is considered to be unimportant (i.e. $\gamma = 1$), and only terminal game states have a non-zero reward. In this special case, the two ideas do become equivalent.


What does the UpdateValue function in line 12 does exactly? In the text it says it is used to adjust the state-action pair value at a given depth, will this be used to the selection? How is this calculated exactly?

This would be done in the "standard" way. The visit count is incremented, and $q$ could be added to a sum of all $q$ values ever observed during the search for that state-action pair, such that the average score can be computed as the sum of $q$ scores divided by visit count. This average score is the $\bar{X}$ used in Selection.


What is the depth parameter needed for? Is it related with the UpdateValue?

I think it's usually not really needed. In the paper they describe that it may be considered to have a depth cut-off. From paper:

"This can be the reach of a terminal state, or episodes can be cut at a certain depth (line 8)."

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • 1
    if that corresponds to the average score, how can we guarantee it will be bounded between 0 and 1? Even if each individual reward is in between those bounds we cannot say that the sum $R + \gamma R_1 +... $ will be in the same bounds right? – Miguel Saraiva Jul 16 '19 at 16:05
  • 2
    @MiguelSaraiva In practice that may be difficult, yes. In theory, you can always guarantee it if you have a fully defined MDP with $\gamma < 1$ or with a theoretical limit on the episode durations. Then you can always find at least some theoretical upper bound on the magnitude of such a discounted sum, and map it to lie in $[0, 1]$ by dividing all rewards by that upper bound. Those $[0, 1]$ bounds are normally only really important for the theoretical proofs anyway, in practice it's fine if we violate them (though sticking to them might make hyperparam tuning easier). – Dennis Soemers Jul 16 '19 at 16:40