Using an LSTM for model-based RL in a POMDP

Question

I am trying to set up an experiment where an agent is exploring an n x n gridworld environment, of which the agent can see some fraction at any given time step. I'd like the agent to build up some internal model of this gridworld.

Now the environment is time-varying, so I figured it would useful to try using an LSTM so the agent can learn potentially useful information about how the environment changes. However, since the agent can only see some of the environment, each observation that could be used to train this model would be incomplete (i.e. the problem is partially-observable from this perspective). Thus I imagine that training such a network would be difficult since there would be large gaps in the data - for example, it may make an observation at position [0, 0] at t = 0, and then not make another observation there until say t = 100.

My question is twofold

Is there a canonical way of working around partial observability in LSTMs? Either direct advice or pointing to useful papers would both be appreciated.
Can an LSTM account for gaps in time between observations?

Thanks!

Note that, in many problems, the state is only partially observable. Even in the Atari games, for example, an observation (i.e. a frame/image from the game) may not correspond to a state. In the DQN paper, they stacked multiple successive observations to build the state, given that multiple successive observations may contain more info about the current situation/state of the agent. — nbro, Dec 11 '20 at 00:12
There is [this question](https://ai.stackexchange.com/q/7721/2444), which is very similar to yours. After reading that question and answer, can you please tell us what is your **specific** question at that point, or if your question is a duplicate of that one? — nbro, Dec 11 '20 at 00:13

Using an LSTM for model-based RL in a POMDP

0 Answers0