0

According to the question in How to deal with the time delay in reinforcement learning?, we can tell the delay in the reinforcement learning can be observation delay, action delay and reward delay.

I have a special case of the delay but I am not sure what kind of delay is it, and how to deal with it.

For example, at a state St0, my agent takes action A1, but we need to wait for a while to gain the reward R1. Meanwhile, my agent keeps taking action A2 and A3. The trick part is A2 and A3 both influence the environment and may affect the R1.

So the timeline is agent plays action A1, A2, and A3, all of them being effective in the environment immediately but we need to wait for a while to see the reward R1, R2 and R3.

Shall we model this question as an observation delay or reward delay?

When my agent receives the R1 but not R2 and R3, can I update my Q-table by eligibility trace or any other kind of method?

CharlesC
  • 1
  • 1
  • Are you in an episodic setting? If so, you could pretend that you get the sum of reward as the terminal reward, i.e. $R_T = \sum_{k=0}^T R_k$. Then you can use any RL algorithm you want (rather than trying to assign partial credit to actions during the episode). – Raphael Lopez Kaufman Sep 12 '22 at 20:15
  • that is the tricky part, my setting is for the continues job, so it may not provide a clear bound as the episodic end. – CharlesC Sep 13 '22 at 15:38
  • Instead of trying to find the "correct" mathematical approach, I wonder if you can get good results simply by applying a sufficient delay, i.e. you sum all rewards you got over a certain period of time T and make as if you received that summed reward at timestep $t+T$. Then you can play around with the value of $T$. – Raphael Lopez Kaufman Sep 14 '22 at 19:38
  • it might work. I was considering a similar approach which whenever I got the reward, I use the reward to update all the (State,action) in the waiting buffer. This means each pair might be updated multiple times. For example, when I got R1, I updaate for A1,A2, A3 with same number; when I got R2, I update A2 and A3. But it does not make any sense in the theory. – CharlesC Sep 14 '22 at 22:36
  • Should I add my suggestion as an answer? – Raphael Lopez Kaufman Sep 16 '22 at 17:13
  • sure and thanks for the answer. we had a good discussion – CharlesC Sep 16 '22 at 18:21

1 Answers1

0

Given that you are not in an episodic setting (based on the comments), I would suggest the following: accumulate rewards for a period of time $T$, that is, at time $t+T$ do as if you had received from the environment a reward $R = \sum_{k=0}^{T-1} R_{t+k}$ .

Then, treat $T$ as a hyperparameter and see which value gives the best agent.