1

In deep-rl techniques, if I understand correctly, a replay buffer is used in training the neural networks. The purpose of using the replay buffer is to store the experience and send a (sampled) batch of unit transitions to train neural networks as it is known that neural networks work well for iid data.

But in games, experience trajectory is important as it contains temporal dynamics. Am I true? If not, all the knowledge required to learn the policy function can be obtained from (out of sequence or randomly sampled) unit transitions alone.

Which one among the both is correct?

Note that unit transition in this question refers to $(s_t, a_t, r_t, s_{t+1})$

hanugm
  • 3,571
  • 3
  • 18
  • 50
  • 1
    This is a confusing question. Please make it clear what your two options are? They are also not currently contradictory - a trajectory can be important and still broken down into parts/reassembled etc. So you need to make it clearer what your concern is about needing to preserve a trajectory as an irreducable unit. Given that many of the earliest papers on Deep RL solve games using a database of single time step transitions, I believe that you have read them (?) and it is pretty clear that the approach works, then I would need to be be very clear on what your concern is before I could answer – Neil Slater Jul 04 '22 at 15:56
  • **But in games, experience trajectory is important as it contains temporal dynamics. Am I true?** I don't know whether it is true or not that the entire trajectory should be taken into account to calculate policy function. I am viewing the first option as a sentence and the second action as a collection of all bigrams. It may be the wrong form of viewing in this case. @NeilSlater – hanugm Jul 05 '22 at 04:55
  • 1
    Thanks for trying to clarify it. I still don't understand what your problem is - I just have a vague idea that you think whole trajectories are irreducable for some RL problems, but no clear understanding of why you think that. This is probably something fundamental in RL that you misunderstand, since it is reasonable to claim that all MDPs contain "temporal dynamics", and single-step temporal difference learning can be applied to any strict MDP (plus is often superior approach in terms of sample efficiency to monte carlo methods). – Neil Slater Jul 05 '22 at 07:10
  • @NeilSlater, okay, I understood that policy can be learned from either total trajectory or unit transitions. I think some entities like expected returns are calculated from trajectory alone and since I'm habituated with the definitions that are generally done using expected returns, I'm thinking that trajectory is mandatory. And transition function of environment can be obtained from the collection of unit transitions also... – hanugm Jul 06 '22 at 17:55

0 Answers0