PPO2: Intuition behind Gumbel Softmax and Exploration?

Question

I'm trying to understand the logic behind the magic of using the gumbel distribution for action sampling inside the PPO2 algorithm.

This code snippet implements the action sampling, taken from here:

def sample(self):
    u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
    return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)

I've understood that is a mathematical trick to be able to backprop over the action sampling in case of categorical variables.

But why can't I just put a softmax layer on top of the logits and sample according to the given probabilities? Why do we need u?
Tere is still the argmax which is not differential. How can backprob work?
Does u allows exploration? Imagine that at the beginning of the learning process, Pi holds small similar values (nothing is learned so far). In this case the action sampling does not always choose the maximum value in Pi because of logits-tf.log(-tf.log(u)). In the further course of the training, larger values arise in Pi, so that the maximum value is also taken more often in the action sampling? But doesn't this mean that the whole process of action sampling is extremely dependent on the value range of the current policy?

PPO2: Intuition behind Gumbel Softmax and Exploration?

0 Answers0