I am trying to set up an experiment where an agent is exploring an n x n gridworld environment, of which the agent can see some fraction at any given time step. I'd like the agent to build up some internal model of this gridworld.
Now the environment is time-varying, so I figured it would useful to try using an LSTM so the agent can learn potentially useful information about how the environment changes. However, since the agent can only see some of the environment, each observation that could be used to train this model would be incomplete (i.e. the problem is partially-observable from this perspective). Thus I imagine that training such a network would be difficult since there would be large gaps in the data - for example, it may make an observation at position [0, 0] at t = 0, and then not make another observation there until say t = 100.
My question is twofold
- Is there a canonical way of working around partial observability in LSTMs? Either direct advice or pointing to useful papers would both be appreciated.
- Can an LSTM account for gaps in time between observations?
Thanks!