I've been trying to train an agent, I've received and read suggestions to improve its speed to reach the goal. The suggestion is to use a time penalty, for example, adding $-0.1$ to the reward each time step.
However, at a first glance, it seems weird to use it because in the core of all RL algorithms we assume that it is a Markov decision process, so it should be able to choose the best action based on the current observation of the environment alone.
If the agent is at the same state but at different times and receives a different reward, wouldn't this be violating somehow the MDP assumption? Or at least preventing it from learning? Because it has no time parameter to learn anything related to time (i.e this state yielded worst expected return because it is in a later time step)