Why is Adam trapped in bad/suspicious local optima after the first few updates?

Question

In the paper On the Variance of the Adaptive Learning Rate and Beyond, in section 2, the authors write

To further analyze this phenomenon, we visualize the histogram of the absolute value of gradients on a log scale in Figure 2. We observe that, without applying warmup, the gradient distribution is distorted to have a mass center in relatively small values within 10 updates. Such gradient distortion means that the vanilla Adam is trapped in bad/suspicious local optima after the first few updates.

Here is figure 2 from the paper.

Can someone explain this part?

Such gradient distortion means that the vanilla Adam is trapped in bad/suspicious local optima after the first few updates.

Why is this true?

score 0 · Answer 1 · answered Dec 11 '20 at 06:56

The authors describe their belief in Section 3:

Due to the lack of samples in the early stage, the adaptive learning rate has an undesirably large variance, which leads to suspicious/bad local optima.

Diving further, in section 3.2:

Adam uses the exponential moving average to calculate the adaptive learning rate. For gradients $\{g_1,\dots,g_t\}$, their exponential moving average has a larger variance than their simple average. Also, in the early stage ($t$ is small), the difference of the exponential weights of $\{g_1,\dots,g_t\}$ is relatively small (up to $1−β^{t−1}_2$).

It seems like the root issue is that exponential moving average, while great with many data samples, have too large of a variance with few data samples. It is this variance that sometimes allow the gradient descend in a bad optima.

Why is Adam trapped in bad/suspicious local optima after the first few updates?

1 Answers1