1

Classical gradient descent algorithms sometimes overshoot and escape minima as they depend on the gradient only. You can see such a problem during the update from point 6.

enter image description here

In classical GD algorithm, the update equation is

$$\theta_{t+1} = \theta_{t} - \eta \times \triangledown_{\theta} \ell$$

In the momentum based GD algorithm, the update equations are

$$v_0 = 0$$ $$v_{t+1} = \alpha v_t + \eta \times \triangledown_{\theta} \ell $$ $$\theta = \theta - \eta \times \triangledown_{\theta} \ell$$

I am writing all the equations concisely by removing the obvious variables used such as inputs to loss functions. In the lecture I'm listening to, the narrator says that momentum-based GD helps during the update at point 6 and the update will not lead to point 7 as shown in the figure and goes towards minima.

But for me, it seems that even momentum-based GD will go to point 7 and the update at point 7 will be benefited from the momentum-based GD as it does not lead to point 8 and goes towards minima.

Am I correct? If not, at which point does the momentum-based GD actually help?

hanugm
  • 3,571
  • 3
  • 18
  • 50

2 Answers2

0

What your professor most probably means is that since momentum adds your previous gradient to the current (as in a moving average). The speed built up till point 7 (moving in to the right) will added to the gradient from 7 to 8 that is going to the left, causing them to cancel each other out (what might cause the update to turn small enough for the loss to converge instead of explode). So it helps at the point where the gradient suddenly changes direction, as the momentum will cancel it out partially.

hal9000
  • 359
  • 2
  • 8
0

Actually it won't go to 7, but will go to somewhere less far away. Because the previous gradients at 5 and before are less steep, the accumulated gradient with momentum will be smaller and it will not shoot as far. The end effect is less overshoot and so it helps in this case.

user559678
  • 101
  • 4