I'm reimplementing an RL paper about learning a job scheduling policy that acts so as to minimize average job completion time. They claim that this is an "input-driven" problem, i.e. much of the variance in rewards is due to the randomness in job arrival sequences, rather than the policy's actions. They deal with this by computing advantages using "input-dependent" baselines. During each training iteration, they have $N$ rollout workers running in parallel on the same job arrival sequence, and after all the rollouts are collected, time-based baselines are computed as follows:
$$b(t) = \frac{1}{N}\sum_{i=1}^N \hat{r}^i(t) \quad \forall t\geq0$$
where $\hat{r}^i(t)$ is some continuous interpolation (e.g. piecewise linear) of the discounted returns $\{r^i_t\}_{t\in\mathcal{T}^i}$ at time $t$ from rollout $i$, where $\mathcal{T}^i \subset \mathbb{R}$ is the set of simulator wall-times of each step in rollout $i$. Then, advantages are computed for each rollout as
$$A^i_t = r^i_t - b(t) \quad \forall t\in\mathcal{T}^i, \forall i\in[N].$$
Since the baselines are computed only as a function of the current training iteration's job arrival sequence, they claim that the variance problem is eliminated. I think this is an interesting approach, and I haven't seen it elsewhere, though I admit I am very new to RL.
Once the advantages are computed, they learn from all the data collected in rollouts $i=1,\ldots,N$ in a single step using REINFORCE, i.e.
$$\theta \gets \theta + \alpha \sum_{i=1}^N \sum_{t\in \mathcal{T}^i} A^i_t\nabla_\theta \log \pi_\theta(a^i_t | s^i_t).$$
I am wondering: is PPO suitable for this problem? Could it work well using these advantage calculations instead of using a critic network? If not, could a critic network be configured so as to reduce variance in a similar way, i.e. taking into account only the current job sequence? Thanks!