Isn't a simulation a great model for model-based reinforcement learning?

Question

Most reinforcement learning agents are trained in simulated environments. The goal is to maximize performance in (often) the same environment, preferably with a minimum amount of interactions. Having a good model of the environment allows to use planning and thus drastically improves the sample efficiency!

Why is the simulation not used for planning in these cases? It is a sampling model of the environment, right? Can't we try multiple actions at each or some states, follow the current policy to look several steps ahead and finally choose the action with the best outcome? Shouldn't this allow us to find better actions more quickly compared to policy gradient updates?

In this case, our environment and the model are kind of identical and this seems to be the problem. Or is the good old curse of dimensionality to blame again? Please help me figure out, what I'm missing.

Hi and welcome to AI SE! What makes you think that simulated environments are or can not be used for planning? — nbro, Apr 09 '20 at 13:06
Hi @nbro! Thank you for the warm comment and the helpful question, which I'm happy to answer: afaiu, the most popular deep RL algorithms are model-free. These are sample inefficient as they learn from pure interactions with the environment. An agent with a good model of the environment could use the model to plan ahead. This way the agent would get experience without interacting with the environment and thus increase the sample efficiency. When this is true, why are popular algorithms like PPO still very sample inefficient? I think I'm missing something. Please help : ) — Ray Walker, Apr 09 '20 at 16:56

score 1 · Answer 1 · answered Nov 28 '20 at 04:10

I will give one perspective on this from the domain of robotics. You are right that most RL agents are trained in simulation particularly for research papers, because it allows researchers to in theory benchmark their approaches in a common environment. Many of the environments exist strictly as a test bed for new algorithms and are not even physically realizable, e.g. HalfCheetah. You could in theory have a separate simulator say running in another process that you use as your planning model, and the "real" simulator is then your environment. But really that's just a mocked setup for what you really want in the end, which is having a real-world agent in a real-world environment.

What you describe could be very useful, with one important caveat: the simulator needs to in fact be a good model of the real environment. For robotics and many other interesting domains, this is a tall order. Getting a physics simulator that faithfully replicates the real-world environment can be tricky, as one may need accurate friction coefficients, mass and center of mass, restitution coefficients, material properties, contact models, and so on. Oftentimes the simulator is too crude an approximation of the real-world environment to be useful as a planner.

That doesn't mean we're completely hosed though. This paper uses highly parallelized simulators to search for simulation parameters that approximate the real-world well. What's interesting is it's not even necessarily finding the correct real-world values for e.g. friction coefficients and such, but it finds values for parameters that, taken together, produces simulations that match the real-world experience. The better the simulation gets at approximating what's going on in the real world, the more viable it is to use the simulator for task planning. I think with the advent of GPU-optimized physics simulators we will see simulators be a more useful tool even for real-world agents, as you can try many different things in parallel to get a sense of what is the likely outcome of a planned action sequence.

score 0 · Answer 2 · answered Jun 23 '20 at 09:56

Shouldn't this allow us to find better actions more quickly compared to policy gradient updates?

It depends on the nature of the simulation. If the simulation models a car as a solid body moving with three $(x,y,\theta)$ degrees of freedom in a plane (hopefully, if it doesn't hit anything and propel vertically), the three ordinary differential equations of solid body motion can be solved quite quickly, compared to a simulation used to model the path of least resistance of a ship on wavy sea, where fluid dynamics equations must be solved, that require a huge amount of resources. OK, the response time needed for a ship is much longer, than for a car, yes, but to compute it predictively, one needs a huge amount of computational power.

score 0 · Answer 3 · answered Dec 28 '20 at 05:33

The question is generalizability. I completely agree though but, ideally the policy found will generalize to more complex environments the model hasn't seen. You could also run a planner on a new scenario but the issue is that it would be too computationally demanding for real time.

Isn't a simulation a great model for model-based reinforcement learning?

3 Answers3

Linked