0

I've been trying to train an agent, I've received and read suggestions to improve its speed to reach the goal. The suggestion is to use a time penalty, for example, adding $-0.1$ to the reward each time step.

However, at a first glance, it seems weird to use it because in the core of all RL algorithms we assume that it is a Markov decision process, so it should be able to choose the best action based on the current observation of the environment alone.

If the agent is at the same state but at different times and receives a different reward, wouldn't this be violating somehow the MDP assumption? Or at least preventing it from learning? Because it has no time parameter to learn anything related to time (i.e this state yielded worst expected return because it is in a later time step)

nbro
  • 39,006
  • 12
  • 98
  • 176
  • By "MDP assumption", do you mean the "Markov assumption"? I had asked a related question [here](https://ai.stackexchange.com/q/24375/2444). [This post that I wrote later](https://nbro.gitlab.io/blogging/2020/11/01/optimal-value-function-of-shifted-rewards/) should clarify that there can be many reward functions that lead to the same optimal policy for continuing tasks. – nbro Jul 15 '23 at 22:01
  • Yes with MDP assumption I mean Markov assumption, because I think the agent would need more than the current observations to know the right action to take. My doubt arises because I read a post in towards data science where they trained multiple pistons to move a ball from one end to another, they used this reward $( \Delta X/X_{e}) \cdot 100 + \tau t$. Where $\tau$ is a time penalty of -0.1. To me this seems like the value function would change with time. – Andrea Carolina Mora Lopez Jul 21 '23 at 16:54

0 Answers0