How are these equations of SGD with momentum equivalent?

Question

I know this question may be so silly, but I can not prove it.

In Stanford slide (page 17), they define the formula of SGD with momentum like this:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

where:

$v_{t+1}$ is the momentum value
$\rho$ is a friction, let say it's equal 0.9
$\nabla f(x_{t-1})$ is the gradient of the objective function at iteration $t-1$
$x_t$ are the parameters
$\alpha$ is the learning rate

However, in this paper and many other documents, they define the equation like this:

$$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

where $\rho$ and $\alpha$ still have the same value as in the previous formula.

I think it should be

$$v_{t}=\alpha \rho v_{t-1}+\alpha \nabla f(x_{t-1})$$

if we want to multiply the learning rate inside the equation.

In some other document (this) or normal form of momentum, they define like this:

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$

I can not understand how can they prove those equations are similar. Can someone help me?

It really doesn't matter. It seems to me the equations are off by constants. Unless you are proving some performance bounds it doesn't matter. As a matter of fact SGD for Neural Nets doesn't even have any theoretical basis as far as I know. It is derived from theoretical methods used in very nice functions (NNs are very bad functions), so it hardly matters what you do in a NN. But in actual optimization theory you have specific formulas to calculate step size and descent direction. — , Dec 31 '20 at 07:02
@DuttaA I am not sure I understand you correctly, but stochastic gradient descent is one of the basic methods in convex optimization, right? And I don't understand the part "NNs are very bad functions", can you explain more about it? — CuCaRot, Dec 31 '20 at 08:03
You are correct. It is a part of CO but NNs are nowhere a CO problem. There are some performance gurantees of the optimization algo when you optimize a CO ( along with some additional constarints like coercivity, and boundedness) you can actually with definiteness say the number of steps required to reach to a local minima (in CO that'll imply a global minima as well). So for this there are particular theories involving matrix analysis, which you cannot do in a NN. The momentum method also cn be given performance gurantees. — , Dec 31 '20 at 09:18
I see, I understand that maybe you feel annoyed by those unclear assumptions. However, as an amateur, I know that NNs is not CO problem, but the performance of SGD (with or without momentum) is really good to optimize the parameters, so I just want to understand the similarity between those equations (for later or maybe the interview). Anyway, happy new year! — CuCaRot, Jan 01 '21 at 09:32
No I don't feel annoyed. Just pointed that out, I have seen SGD (been guilty of it myself) and convex terms thrown a lot around NNs when the relationship is not true. — , Jan 02 '21 at 12:18

score 4 · Accepted Answer · answered Dec 31 '20 at 06:33

The first two equations are equivalent. The last equation can be equivalent if you scale $\alpha$ appropriately.

Equation 1

Consider the equation from the Stanford slide:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

Let's evaluate the first few $v_t$ so that we can arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \nabla f(x_0) = \nabla f(x_0)\\ v_2 = \rho v_1 + \nabla f(x_1) = \rho \nabla f(x_0) + \nabla f(x_1)\\ v_3 = \rho v_2 + \nabla f(x_2) = \rho^2 \nabla f(x_0) + \rho \nabla f(x_1) + \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i) $

So the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i)$$

Equation 2

Now consider the equation from the paper: $$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

We again evaluate the first few $v_t$ to arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \alpha \nabla f(x_0) = \alpha \nabla f(x_0)\\ v_2 = \rho v_1 + \alpha \nabla f(x_1) = \rho \alpha \nabla f(x_0) + \alpha \nabla f(x_1)\\ v_3 = \rho v_2 + \alpha \nabla f(x_2) = \rho^2 \alpha \nabla f(x_0) + \rho \alpha \nabla f(x_1) + \alpha \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i) $

So the closed from update is:

$$x_t = x_{t-1} - \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i)$$

As you can see, this is equivalent to the previous closed form update. The only difference is if $\alpha$ is inside or outside the summation, but since it is a constant, it doesn't really matter anyways.

Equation 3

As for the last equation

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$ Let's do the same thing:

$v_0 = 0 \\ v_1 = \rho v_0 + (1-\rho) \nabla f(x_0) = (1-\rho) \nabla f(x_0)\\ v_2 = \rho v_1 + (1-\rho) \nabla f(x_1) = \rho (1-\rho) \nabla f(x_0) + (1-\rho) \nabla f(x_1)\\ v_3 = \rho v_2 + (1-\rho) \nabla f(x_2) = \rho^2 (1-\rho) \nabla f(x_0) + \rho (1-\rho) \nabla f(x_1) + (1-\rho) \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i) $

And so the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i)$$

This equation is equivalent to the other two as long as you scale $\alpha$ by a factor of $\displaystyle \frac{1}{1-\rho}$.

Great, this is exactly what I want to hear. Thank you for a very detailed answer — CuCaRot, Dec 31 '20 at 07:58

How are these equations of SGD with momentum equivalent?

1 Answers1

Equation 1

Equation 2

Equation 3

Linked