IQN bellman target: using Z vs using Q

Question

IQN paper (https://arxiv.org/abs/1806.06923) uses distributional bellman target: $$ \delta^{\tau,\tau'}_t = r_t + \gamma Z_{\tau'}(x_{t+1}, \pi_{\beta}(x_{t+1})) - Z_{\tau}(x_t, a_t) $$ And optimizes: $$ L = \frac{1}{N'} \sum^{N}_i \sum^{N'}_j \rho^\kappa_{\tau_i} \delta^{\tau_i,\tau_j}_t $$

But similar quantiles can be got just from Q values, when doing so: $$ \delta^\tau_t = r_t + \gamma \frac{1}{N'} \sum_{j}^{N'} Z_{\tau_j}(x_{t+1}, \pi_{\beta}(x_{t+1})) - Z_\tau(x_t, a_t) \\ = r_t + \gamma Q (x_{t+1}, \pi_\beta(x_{t+1})) - Z_\tau(x_t, a_t) $$ optimizing: $$ L = \sum^N_i \rho^{\kappa}_{\tau_i} \delta^{\tau_i}_t $$

Both lead to similar performance on CartPole env. The loss function of the 2nd one is more simpler and intuitive (atleast to me). So i was thinking if there are any obvious reason why authors didin't use it?

score 1 · Answer 1 · answered Apr 04 '19 at 11:27

Replacement you suggest is replacement of random variable by its expectation in forward part of TD. It would make IQN into modification of C51 with randomly sampled function approximator instead of discrete distribution. Both distribution produced and especially exploration behavior with your replacement would be very different. The authors of paper explicitly said that "more randomness" in their opinion benefit training, so reducing randomness would go aginst spirit of the paiper. That they produce similar results on single toy test mean very little. IQN could be better then C51 or it could be worse then C51 but single toy example is not enough to say they are close. Nevertheless I agree that IQN looks overly complex and may require more training time, C51 approach could be more practical.

IQN bellman target: using Z vs using Q

1 Answers1