Based on OpenAI Spinning Up description of Soft Actor Critic (SAC) the soft Q-function is defined as
and as they say
Q value is changed to include the entropy bonuses from every timestep except the first.
I feel like it should make sense somehow, but they do not give any further explanation, and I don't see why it is correct. Especially because in the soft value function the first bonus term is also used:
Could someone please explain this?