2

Planning problems have been the first problems studied at the dawn of AI (Shakey the robot). Graph search (e.g. A*) and planning (e.g. GraphPlan) algorithms can be very efficient at generating a plan. As for problem formulation, for planning problems PDDL is preferred. Although planning problems in most cases only have discrete states and actions, the PDDL+ extention covers continuos dimensions of the planning problem also.

If a planning problem has non-deterministic state transitions, classical planning algorithms are not a well suited solution method, (some) reinforcement learning is considered to be a well suited solution in this case. From this point of view, if a planning problem has a state transition probability less than 1, classical planning methods (A*, GraphPlan, FF planner, etc.) are not the right tool to solve these problems.

Looking at reinforcement learning examples, in some cases environments are used to showcase reinforcement learning algorithms which could be very well solved by search/planning algorithms. They are fully deterministic, fully observable and sometimes they even have discrete action and state spaces.

Given an arbitrary fully deterministic planning problem, what is/are the chartacteristic(s) which make reinforcement learning, and not "classical planning" methods better suited to solve the problem?

50k4
  • 225
  • 1
  • 8
  • Note that, "pure" reinforcement learning algorithms, like Q-learning, are not planning algorithms. In RL, planning algorithms are e.g. value iteration or policy iteration, which require a model of the environment to solve the planning problem. See [this answer](https://ai.stackexchange.com/a/12067/2444). So, you may want to clarify that in your post. – nbro Feb 08 '21 at 14:06
  • I inverted the statement to make it less general. Not all rl algo can solve planning, but non-deterministic planning can be solved by rl algo – 50k4 Feb 08 '21 at 14:11
  • If by planning you mean finding a policy, yes, RL can be used for that. However, typically, in planning, you assume to have a model (i.e. the transition system or function), but not all RL algorithms use a model of the environment (so not all RL algorithms are "planning" algorithms in this sense), so I would clarify this: i.e. algorithms like Q-learning don't "plan" before executing, they execute (i.e. immediately take actions in the environment and observe the outcomes). I think you're referring to algorithms like **value iteration**. Anyway, this is an interesting question. – nbro Feb 08 '21 at 14:13
  • planing is generating a sequence of decisions which leads to the goal. If I train a policy, I can use the policy to generate a plan, right? If I have a planning problem I have a model. If want to solve it with model free rl, can I just use the model as an environment? It might be very inefficient, but it could be used that way, right? – 50k4 Feb 08 '21 at 14:18
  • Model-free algorithms, like Q-learning, don't use the transition function, i.e. the model, which is often not available. Q-learning is not considered a planning algorithm because it actually doesn't plan anything (i.e. it doesn't try to find the plan **before** executing the actions: it actually executes the actions, which can be catastrophic, and learns from trial-and-error, i.e. Q-learning is a learning algorithm, because it learns from experience). – nbro Feb 08 '21 at 14:21
  • That's why I'm saying that you should edit your post to clarify the difference between "RL algorithm" (like Q-learning) and "dynamic programming algorithm" (like **value iteration**). Both these algorithms are used to solve MDPs, i.e. find a policy, but they are not both planning algorithms. If you read [this answer](https://ai.stackexchange.com/a/12067/2444), you should understand better what I'm talking about. – nbro Feb 08 '21 at 14:21
  • I am not sure I agree with you, but that would be a different question... Maybe I will post that question also. I am really interested generally when to use rl vs. classical planning and/or control theory and/or classical optimization. I added the (some) qualifier to make sure to exclude the rl algo.s which would not be usable. – 50k4 Feb 08 '21 at 19:20

0 Answers0