I have set up a RL environment and it converges to a decent solution when using a reward function:
$R(s_t,a_t) = fenv(s_t, a_t)$ , where $fenv$ is the environment dynamics.
Now, i want to change the reward function such that
$R(s_t,a_t) = fenv(s_t, a_t)*c(s_t:s_{t-10})$
where $c(s_t:s_{t-10})$ is a penalty term that depends on the average performance of previous 10 timesteps. So now the reward is dependent on agent's previous states and needs information from the past. I suspect this can change the underlying problem and the MDP assumption is not valid anymore.
I read about potential based reward shaping that guarantees invariance in terms of MDP but i am not sure if this transformation falls into that category.
So my question is, by including information from the past states into the reward definition, do we still maintain the underlying MDP or are we solving a different problem altogether? Assuming that we don't explicitly add previous action choices in the state representation, as suggested here https://ai.stackexchange.com/a/25991/54470