3

I know this question may be so silly, but I can not prove it.

In Stanford slide (page 17), they define the formula of SGD with momentum like this:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

where:

  • $v_{t+1}$ is the momentum value
  • $\rho$ is a friction, let say it's equal 0.9
  • $\nabla f(x_{t-1})$ is the gradient of the objective function at iteration $t-1$
  • $x_t$ are the parameters
  • $\alpha$ is the learning rate

However, in this paper and many other documents, they define the equation like this:

$$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

where $\rho$ and $\alpha$ still have the same value as in the previous formula.

I think it should be

$$v_{t}=\alpha \rho v_{t-1}+\alpha \nabla f(x_{t-1})$$

if we want to multiply the learning rate inside the equation.

In some other document (this) or normal form of momentum, they define like this:

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$

I can not understand how can they prove those equations are similar. Can someone help me?

CuCaRot
  • 892
  • 3
  • 15
  • It really doesn't matter. It seems to me the equations are off by constants. Unless you are proving some performance bounds it doesn't matter. As a matter of fact SGD for Neural Nets doesn't even have any theoretical basis as far as I know. It is derived from theoretical methods used in very nice functions (NNs are very bad functions), so it hardly matters what you do in a NN. But in actual optimization theory you have specific formulas to calculate step size and descent direction. –  Dec 31 '20 at 07:02
  • @DuttaA I am not sure I understand you correctly, but stochastic gradient descent is one of the basic methods in convex optimization, right? And I don't understand the part "NNs are very bad functions", can you explain more about it? – CuCaRot Dec 31 '20 at 08:03
  • You are correct. It is a part of CO but NNs are nowhere a CO problem. There are some performance gurantees of the optimization algo when you optimize a CO ( along with some additional constarints like coercivity, and boundedness) you can actually with definiteness say the number of steps required to reach to a local minima (in CO that'll imply a global minima as well). So for this there are particular theories involving matrix analysis, which you cannot do in a NN. The momentum method also cn be given performance gurantees. –  Dec 31 '20 at 09:18
  • I see, I understand that maybe you feel annoyed by those unclear assumptions. However, as an amateur, I know that NNs is not CO problem, but the performance of SGD (with or without momentum) is really good to optimize the parameters, so I just want to understand the similarity between those equations (for later or maybe the interview). Anyway, happy new year! – CuCaRot Jan 01 '21 at 09:32
  • No I don't feel annoyed. Just pointed that out, I have seen SGD (been guilty of it myself) and convex terms thrown a lot around NNs when the relationship is not true. –  Jan 02 '21 at 12:18

1 Answers1

4

The first two equations are equivalent. The last equation can be equivalent if you scale $\alpha$ appropriately.

Equation 1

Consider the equation from the Stanford slide:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

Let's evaluate the first few $v_t$ so that we can arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \nabla f(x_0) = \nabla f(x_0)\\ v_2 = \rho v_1 + \nabla f(x_1) = \rho \nabla f(x_0) + \nabla f(x_1)\\ v_3 = \rho v_2 + \nabla f(x_2) = \rho^2 \nabla f(x_0) + \rho \nabla f(x_1) + \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i) $

So the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i)$$

Equation 2

Now consider the equation from the paper: $$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

We again evaluate the first few $v_t$ to arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \alpha \nabla f(x_0) = \alpha \nabla f(x_0)\\ v_2 = \rho v_1 + \alpha \nabla f(x_1) = \rho \alpha \nabla f(x_0) + \alpha \nabla f(x_1)\\ v_3 = \rho v_2 + \alpha \nabla f(x_2) = \rho^2 \alpha \nabla f(x_0) + \rho \alpha \nabla f(x_1) + \alpha \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i) $

So the closed from update is:

$$x_t = x_{t-1} - \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i)$$

As you can see, this is equivalent to the previous closed form update. The only difference is if $\alpha$ is inside or outside the summation, but since it is a constant, it doesn't really matter anyways.

Equation 3

As for the last equation

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$ Let's do the same thing:

$v_0 = 0 \\ v_1 = \rho v_0 + (1-\rho) \nabla f(x_0) = (1-\rho) \nabla f(x_0)\\ v_2 = \rho v_1 + (1-\rho) \nabla f(x_1) = \rho (1-\rho) \nabla f(x_0) + (1-\rho) \nabla f(x_1)\\ v_3 = \rho v_2 + (1-\rho) \nabla f(x_2) = \rho^2 (1-\rho) \nabla f(x_0) + \rho (1-\rho) \nabla f(x_1) + (1-\rho) \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i) $

And so the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i)$$

This equation is equivalent to the other two as long as you scale $\alpha$ by a factor of $\displaystyle \frac{1}{1-\rho}$.

user3667125
  • 1,500
  • 5
  • 13