3

I'm doing a research on a finite-horizon Markov decision process with $t=1, \dots, 40$ periods. In every time step $t$, the (only) agent has to chose an action $a(t) \in A(t)$, while the agent is in state $s(t) \in S(t)$. The chosen action $a(t)$ in state $s(t)$ affects the transition to the following state $s(t+1)$.

In my case, the following holds true: $A(t)=A$ and $S(t)=S$, while the size of $A$ is $6 000 000$ (6 million) and the size of $S$ is $10^8$. Furthermore, the transition function is stochastic.

Would Monte Carlo Tree Search (MCTS) an appropriate method for my problem (in particular due to the large size of $A$ and $S$ and the stochastic transition function?)

I have already read a lot of papers about MCTS (e.g. progressive widening and double progressive widening, which sound quite promising), but maybe someone can tell me about his experiences applying MCTS to similar problems or about appropriate methods for this problem (with large state/action space and a stochastic transition function).

nbro
  • 39,006
  • 12
  • 98
  • 176
D. B.
  • 101
  • 6
  • What do you intend to use MCTS for exactly? – nbro Feb 15 '19 at 19:35
  • I would like to apply MCTS in the context of a stochastic optimization problem, which is why only 1 player exists in my problem. – D. B. Feb 18 '19 at 17:22
  • But which specific part of your problem you would like to solve with MCTS? – nbro Feb 19 '19 at 14:10
  • Do similar actions yield similar results? I.E. if I choose between two actions that appear similar, will they have very similar value in the long run? If so, it may be appropriate to embed the actions in some continuous space in the agent's policy, then perform an MCTS on the k-nearest neighbors (which are valid actions, of course) to the desired continuous proto-action. This is similar to this paper from DeepMind: https://arxiv.org/abs/1512.07679 – Andrew Sansom Mar 09 '21 at 03:29
  • i use MCTS for finding max action all the times, at least the action is somehow near the optimal action. Brute-force random action is crazy for problems with large possibility space – Dee Mar 11 '21 at 08:13
  • I'm not sure if there's a way to make MCTS work for this problem. But I would like to remark that $10^8$ states can probably be exhausted in a reasonable amount of time. And that your branching factor is quite close to the size of the state space, leading to short path on average. So there might be something to exploit there. – Celelibi Jun 12 '21 at 21:24

2 Answers2

1

MCTS is often said to be a good choice for problems with large branching factors... but the context where that sentiment comes from is that it originally became popular for playing the game of Go, as an alternative to older game-playing approaches such as alpha-beta pruning. The branching factor of Go is more like 250-300 though, which is often viewed as a large branching factor for board games. It's not such an impressive branching factor anymore when compared to your branching factor of $6,000,000$...

I don't see MCTS working well out of the box when you have 6 million choices at every step. Maybe it could do well if you have an extremely efficient implementation of your MDP (e.g. if you can simulate millions of roll-outs per second), and if you have a large amount of "thinking time" (probably in the order of hours or days) available.


To have any chance of doing better with such a massive branching factor, you really need generalization across actions. Are your 6 million actions really all entirely different actions? Or are many of them somehow related to each other? If you gather some experience (a simulation in MCTS, or just a trajectory with Reinforcement Learning approaches), can you generalize the outcome to other actions for which you did not yet collect experience?

If there is some way of treating different actions as being "similar" (in a given state), you can use a single observation to update statistics for multiple different actions at once. The most obvious way would be if you can define meaningful features for actions (or state-action pairs). Standard Reinforcement Learning approaches (with function approximation, maybe linear or maybe Deep Neural Networks) can then relatively "easily" generalize in a meaningful way across lots of actions. They can also be combined with MCTS in various ways (see for example AlphaGo Zero / Alpha Zero).

Even with all that, a branching factor of 6 million still remains massive... but generalization across actions is probably your best bet (which may be done inside MCTS, but really does need a significant number of bells and whistles on top of the standard approach).

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
-2

The question is whether MCTS is an appropriate method given these conditions.

  • Action $a_t \in A_t$

  • State $s_t \in S_t$

  • $[a_t, s_t] \Rightarrow s_{t+1}$

  • $t_i \land i \in [1, 40]$

  • $A(t) \in \text{set} \, A \; \land \; \text{size} (A) = 6 x 10^6$

  • $S(t) \in \text{set} \, S \; \land \; \text{size} (S) = 1 x 10^8$

  • Transition function is stochastic

MCTS focuses on the most promising sub-trees and is therefore asymmetric and well suited for some systems with a high branching factor, but the set sizes can be prohibitive unless some symmetries are exploited.

The original MCTS algorithm does not converge quickly. This is why alpha–beta pruning and other search strategies that focus on the minimization of the search space are often used, but they have limitations with regard to branching breadth too. Pruning heuristics can help with reducing the combinatoric explosion but may lead to the most advantageous action options being missed.

Some variants of MCTS, such as those among the articles below and in the case of AlphaZero used to learn Go from only the game rules, show excellence in search speed, but perhaps not to the degree needed in this case without parallelism in the form of a computing cluster or significant hardware acceleration.

An excellent characteristic of MCTS is that highly promising actions found early may be selected without exhausting all the sub-tree evaluations, but again, that may not be enough.

A possibly constraining consideration is whether the Marcov property will be upheld, how, and whether it should be. Another consideration is whether the system involves an opposing intelligent participant. All sampling or pruning search strategies involve the risk of not reliably identifying a single branch in a sub-tree that leads readily to an irreversible (or difficult to reverse) loss.

These are some excellent papers that discuss these considerations and provide a high level of experience in the form of research results. The computational challenge of tree searches with high branching is covered by the articles discussing approximate solvers, an area of high research intensity.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
  • Thank you for the detailed answer and the suggestions. But does a stochastic transition function really imply the need for a POMDP? In my case every state is fully observable (nevertheless the transitions are stochastic) - so I think a MDP is adequate. Are there any suggestions for coping with stochastic transition functions (in the context of MDP & MCTS)? From what I've read, the stochastic transition is taken into account by the generative model, but is that really all? I doubt that in my case, because every state-action-pair has (due to stochastic) 90 potential subsequent states – D. B. Feb 06 '19 at 14:43