1

How can I prove that gradient descent doesn't necessarily find the global optimum?

For example, consider the following function

$$f(x_1, x_2, x_3, x_4) = (x_1 + 10x_2)^2 + 5x_2^3 + (x_2 + 2x_3)^4 + 3x_1x_4^2$$

Assume also that we can't find the optimal value for the learning rate because of time constraints.

nbro
  • 39,006
  • 12
  • 98
  • 176

3 Answers3

2

You can find by yourself a counterexample that, in general, GD is not guaranteed to find the global optimum!

I first advise you to choose a simpler function (than the one you are showing), with 2-3 optima, where one is the global and the other(s) are local. You don't need neural networks or any other ML concept to show this, but only basic calculus (derivatives) and numerical methods (i.e. gradient descent). Just choose a very simple function with more than one optimum and apply the basic gradient descent algorithm. Then you can see that, if you start gradient descent close to one local optimum (i.e. you choose an initial value for $x$ or $\theta$, depending on your notation for the variable of the function) and then you apply gradient descent (for some iterations), you will end up in that close local optimum, from which you cannot escape, after having applied the gradient descent steps.

See also the question Does gradient descent always converge to an optimum? and For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?

nbro
  • 39,006
  • 12
  • 98
  • 176
1

Well, GD terminates once the gradients are 0, right? Now, in a non-convex function, there could be some points, which do not belong to the global minima, and yet, have 0 gradients. For example, such points can belong to saddle points and local minima.

Consider this picture and say you start GD at the x label. enter image description here

GD will bring you the flat area and will stop making progress there as gradients are 0. However, as you can see, global minima is to the left of this flat region.

By the same token, you have to show, for your own function, that there exists at least a single point whose gradients are 0 and yet, it is not the global minima.

In addition to that, the guarantee on converge for convex functions depends on annealing the learning rate appropriately. For example, if your LR is too high, GD can just keep overshooting the minima. The visualization from this page might help you to understand more regarding the behavior of GD.

SpiderRico
  • 960
  • 8
  • 18
  • In convex function with an unknown learning rate value (which is not optimal value of that as mentioned) how to ensure that it converges or not? – Mostafa Ghadimi Mar 17 '20 at 11:44
  • If the function is convex, GD converges with a decaying learning rate. I guess if you don't make any assumptions on the learning rate, it could keep overshooting the global minima. Say you're 1 units left of minima but due to big LR, you could go like 3 units right of minima, then back to 1 units left of minima and so on. – SpiderRico Mar 17 '20 at 11:47
  • how to find out whether LR has decaying rate or not? Is there any way to find out the oscillation? – Mostafa Ghadimi Mar 17 '20 at 23:05
  • well in your question, are you allowed to set the lr? if not, you have to look for auxiliary information. for example, if your training loss/function value jumps back and forth, this probably indicates that GD is overshooting its target due to big lr. – SpiderRico Mar 17 '20 at 23:11
  • I'm allowed but can't find the optimal value for lr due to time limit. – Mostafa Ghadimi Mar 17 '20 at 23:26
  • okay then you have to monitor the function value and decrease the learning rate when it starts to overshoot. also look into lr scheduling methods such as line search in general. – SpiderRico Mar 17 '20 at 23:29
  • please add it to your answer to confirm it. gimme upvote in case you want. thanks – Mostafa Ghadimi Mar 17 '20 at 23:37
  • Sure. I've added a link that visualizes how GD behaves on loss surfaces. You might find it useful. – SpiderRico Mar 18 '20 at 16:21
0

There is no way you can be sure you have reached a global minimum. Steepest descent will converge toward where the gradient approaches zero. Depending on the initial conditions ( ie the initial values of the weights) you can and will converge on some minimum. Notice if you run your model several times with random weight initialization you will get slightly different results. What I do find interesting is that it seems that in general the local minima have roughly the same value. The cost function is some kind of surface in N space where N is the number of trainable parameters. We do not know what that surface is like and how many local minimums exist.

Gerry P
  • 694
  • 4
  • 10
  • You are not even guaranteed to converge to a local minima. You could converge to a saddle point with plain GD. – SpiderRico Mar 17 '20 at 23:13