Can we efficiently solve a problem in which:
- the valid actions at any given time are changing
- the environment is stochastic
- we have an infinite time horizon
using MCTS?
To be more specific, I'm attempting to optimize order batching and pick planning in a warehouse by formulating it as an MDP and searching for a good enough plan with MCTS. Within this environment:
- The current state is fully known (what items need to be picked and where they are, the location of workers)
- The set of possible actions is dependent on what items need to be picked
- The state of the warehouse is changing stochastically over time (orders coming in)
MCTS appears to be a good approach to solve this problem because of the large branching search space and our existing ability to simulate a rollout of a policy and the changing state.
This answer points towards an "open loop" formulation of the problem. However, a set of actions (i.e. [<pick #189>, <pick #734>, , <pick #334>...]) is not even necessarily valid, given that between each action there is a chance for the state to change (new orders arriving).
How can I formulate this as MCTS? Is the necessarily going to need "chance nodes"? Can we encode the state as a belief state instead?