Can PPO be applied when the environment is "input-driven"?

Question

I'm reimplementing an RL paper about learning a job scheduling policy that acts so as to minimize average job completion time. They claim that this is an "input-driven" problem, i.e. much of the variance in rewards is due to the randomness in job arrival sequences, rather than the policy's actions. They deal with this by computing advantages using "input-dependent" baselines. During each training iteration, they have $N$ rollout workers running in parallel on the same job arrival sequence, and after all the rollouts are collected, time-based baselines are computed as follows:

$$b(t) = \frac{1}{N}\sum_{i=1}^N \hat{r}^i(t) \quad \forall t\geq0$$

where $\hat{r}^i(t)$ is some continuous interpolation (e.g. piecewise linear) of the discounted returns $\{r^i_t\}_{t\in\mathcal{T}^i}$ at time $t$ from rollout $i$, where $\mathcal{T}^i \subset \mathbb{R}$ is the set of simulator wall-times of each step in rollout $i$. Then, advantages are computed for each rollout as

$$A^i_t = r^i_t - b(t) \quad \forall t\in\mathcal{T}^i, \forall i\in[N].$$

Since the baselines are computed only as a function of the current training iteration's job arrival sequence, they claim that the variance problem is eliminated. I think this is an interesting approach, and I haven't seen it elsewhere, though I admit I am very new to RL.

Once the advantages are computed, they learn from all the data collected in rollouts $i=1,\ldots,N$ in a single step using REINFORCE, i.e.

$$\theta \gets \theta + \alpha \sum_{i=1}^N \sum_{t\in \mathcal{T}^i} A^i_t\nabla_\theta \log \pi_\theta(a^i_t | s^i_t).$$

I am wondering: is PPO suitable for this problem? Could it work well using these advantage calculations instead of using a critic network? If not, could a critic network be configured so as to reduce variance in a similar way, i.e. taking into account only the current job sequence? Thanks!

Can you please put your specific question in the title? – nbro Feb 20 '23 at 11:29 — nbro, Feb 20 '23 at 11:29
@nbro I've updated the title – Archie Gertsman Feb 20 '23 at 20:42 — Archie Gertsman, Feb 20 '23 at 20:42

Can PPO be applied when the environment is "input-driven"?

0 Answers0