1

I'm quite new on the study of reinforcement learning, and Im working on a communication problem with continuous large actions range for my final graduation work. I'm trying to use Gaussian Policy and Police Gradient methods for that implementation. I will try to explain the sequential logic of the task to better understand... From the current observation (o_t), the agent takes actions reaching a new state (s_t+1). This new state is indeed a function of the previous obersvation (o_t) and the actions (a_t) taken by the agent. However, this next state (s_t+1) is not the next observation that the agent will take into account to take new actions, being only useful for calculating the reward in that iteration for the policy optimization. My question consists of the possibility that RL can be applied to this type of problem, where the state reached by the agent's actions in a given instant of time is not my next observation that the agent will take into account to take the next actions.

  • Not an expert on RL myself yet, but it seems you are describing a POMDP = Partially Observable MDP (as opposed to a vanilla MDP) which is used a lot in RL. So yes :) – Felix Goldberg Jun 25 '23 at 11:00

1 Answers1

0

Yes, there are RL algorithms (dreamer v1,2,3) that predict world model to simulate what (h_t+1) given its sampled action and previous state (h_t). It runs simulated world model during critic-actor so it can be much more data efficient. https://arxiv.org/abs/2301.04104v1

ipoppo
  • 101
  • 1
  • 1
    You can use mathjax on this site. – nbro Jan 31 '23 at 11:24
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 02 '23 at 09:31