How to handled delayed rewards in contextual bandits

Question

All the examples I see in the tf_Agents for contextual bandits, involves a reward function we generated the reward instantly after an observation has been generated.

But, in my real world usecase (say sending emails and waiting for the click rate), rewards will be observed after 3 days of generating the observation. How to include this scenario of delayed rewards in training the agent ?

Note that questions about how to implement something in some specific library are off-topic here. So, I recommend to remove tf_agents from the title to remove the emphasis on the library, although you could ask, as an additional question, if someone knows how to do it in that particular library, but that shouldn't be your main question. — nbro, May 03 '22 at 08:50

Neil Slater · Answer 1 · 2022-05-02T19:26:38.400

The update rules are not any different.

However, if you make many other decisions in the meantime, the timestamps that you are able to run estimate updates for will lag behind the current timestamp.

You will need a buffer of pending rewards, noting the state and action taken. You can clear the buffer, turning it into training data and running updates, once you have the matching rewards.

A lot of tutorial material will use the term $Q_t(s,a)$ to represent the current estimate of expected reward in update rules and to drive exploitation versus exploration. In a practical system with delays you will have to use the best available estimates instead. You don't need the subscript $t$ on the estimation function except for plotting learning curves - if you do so you will need to decide whether you use the decision time to label the estimates, or the update time when the estimates are revised (it doesn't really matter, the graph will look the same, and mean the same thing, just with an offset).

Typically you will also plot total reward or regret, or some combination. This plot will be the same as before. The impact of the reward delay will be slower initial learning and slower responsiveness to non-stationary environments, but there is no way around that.

Once you ignore (or change) the subscript $t$ on $Q_t$, then all the equations work the same as before.

Thank you. regarding "You will need a buffer of pending rewards, ........ once you matching rewards.", How to build a buffer of pending rewards. When predicting, the buffer has a associated reward and will save it in trajectory (this is a dummy reward, since actual reward will be obtained after 3 days). Now after 3 days I wanted to update the reward in the saved trajectory (to observed reward). Is there a way to update the items in the saved trajectories ? can you please clarify this too if possible -> https://stackoverflow.com/questions/72090295/how-to-write-a-custom-policy-in-tf-agents — tjt, May 02 '22 at 19:34
@tjt Store the partially complete records in a file or datastore until you receive the rewards and can send them to the update routine or script. How to do so will depend on the language, libraries and available infrastructure for your project. It's not an AI issue as such, but a code/design one. — Neil Slater, May 02 '22 at 19:41
Sure, that's the path I had in mind initially. but wondered if I Can do it using customer agents, environments rather than using replay buffer. but looks like using the buffer is the only way — tjt, May 03 '22 at 01:29

How to handled delayed rewards in contextual bandits

1 Answers1