2

I am new to reinforcement learning, but, for a finite horizon application problem, I am considering using the average reward instead of the sum of rewards as the objective. Specifically, there are a total of $T$ maximally possible time steps (e.g., the usage rate of an app in each time-step), in each time-step, the reward may be 0 or 1. The goal is to maximize the daily average usage rate.

Episode length ($T$) is maximally 10. $T$ is the maximum time window the product can observe about a user's behavior of the chosen data. There is an indicator value in the data indicating whether an episode terminates. From the data, it is offline learning, so in each episode, $T$ is given in the data. As long as an episode doesn't terminate, there is a reward of $\{0, 1\}$ in each time-step.

I heard if I use an average reward for the finite horizon, the optimal policy is no longer a stationary policy, and optimal $Q$ function depends on time. I am wondering why this is the case.

I see normally, the objective is defined maximizing

$$\sum_t^T \gamma^t r_t$$

And I am considering two types of average reward definition.

  1. $1/T(\sum^_{t=0}\gamma^t r_t)$, $T$ varies is in each episode.

  2. $1/(T-t)\sum^T_{i=t-1}\gamma^i r_i$

nbro
  • 39,006
  • 12
  • 98
  • 176
lll
  • 121
  • 2
  • How does your horizon relate to episode length (or is the target environment continuous)? What controls value of $T$? Is the value of $T$ known at $t=0$? What do you mean by "iteration" - is that {0, 1} reward per time step, or is it {0, 1} reward per $t=0$ to $T$ pseudo-episode? – Neil Slater Aug 10 '20 at 09:58
  • @NeilSlater i added the edits – lll Aug 10 '20 at 18:55
  • Thanks for the edit. So, to be clear, $T$ is not actually the length of an "episode", but the size of an observation window into a continuous process? There is essentially no natural episode, just observations, and some of them form into small trajectories where state progresses due to actions taken? – Neil Slater Aug 10 '20 at 18:58
  • T can be thought of as episode since after maximum of 10 time steps, users will go to another category of the app, which is not what we care about. And yes, it can thought of as no natural episode; state progresses due to actions taken – lll Aug 10 '20 at 20:44
  • For "I heard if I use an average reward for the finite horizon" I would be interested to know where you heard/read this from, as on the surface it looks incorrect to me for your problem (i.e. I agree with you). There may be some context to it that makes it true for some problems and not others, so I would like to see. Do you have a link? – Neil Slater Aug 11 '20 at 10:39
  • i asked a person doing RL research, she said "In finite (small) horizon , the optimal policy is no longer a stationary policy, and optimal Q Function depends on time." but i did not get the reason. – lll Aug 11 '20 at 19:35
  • I'd suspect she misunderstood the context of what you were asking about. Or you have misremembered. Seems wrong to me. Are you in a position to ask the researcher for clarification or a reason? – Neil Slater Aug 11 '20 at 20:50
  • she emailed me this sentence when I asked if it is ok to use average reward as objective for q-learning and she said the following. If you think it does not matter to use q-learning in average reward setting, can you also kindly tell me why – lll Aug 12 '20 at 00:56
  • I suggest you ask the researcher for clarification, otherwise it is not possible to fairly critique the statement. She may have had in mind something else based on your description, and that makes it a miscommunication. The statement "it does not matter to use q-learning in average reward setting" is missing too much context to answer for me, and possibly the researcher was faced with similar missing details and made an assumption about what you were doing. – Neil Slater Aug 12 '20 at 08:02

0 Answers0