1

I have set up a RL environment and it converges to a decent solution when using a reward function:

$R(s_t,a_t) = fenv(s_t, a_t)$ , where $fenv$ is the environment dynamics.

Now, i want to change the reward function such that

$R(s_t,a_t) = fenv(s_t, a_t)*c(s_t:s_{t-10})$

where $c(s_t:s_{t-10})$ is a penalty term that depends on the average performance of previous 10 timesteps. So now the reward is dependent on agent's previous states and needs information from the past. I suspect this can change the underlying problem and the MDP assumption is not valid anymore.

I read about potential based reward shaping that guarantees invariance in terms of MDP but i am not sure if this transformation falls into that category.

So my question is, by including information from the past states into the reward definition, do we still maintain the underlying MDP or are we solving a different problem altogether? Assuming that we don't explicitly add previous action choices in the state representation, as suggested here https://ai.stackexchange.com/a/25991/54470

  • 2
    I believe this question is related and answers your concerns: https://ai.stackexchange.com/questions/25990/reinforcement-learning-algorithm-with-rewards-dependent-both-on-previous-action?rq=1 – postnubilaphoebus Oct 21 '22 at 16:07
  • Thanks @postnubilaphoebus! This confirms my suspicion about loosing the Markovian property in this case. – StarDust_08 Oct 25 '22 at 09:13
  • You're note asking any question explicitly. Could you please edit your post to actually ask a question? Thanks. – nbro Dec 30 '22 at 15:36
  • My bad @nbro! Hope the question is clear now. Thanks for your time. – StarDust_08 Jan 03 '23 at 14:40

0 Answers0