Does gradient descent in deep learning assume a smooth fitness landscape?

Question

I've come across the concept of fitness landscape before and, in my understanding, a smooth fitness landscape is one where the algorithm can converge on the global optimum through incremental movements or iterations across the landscape.

My question is: Does deep learning assume that the fitness landscape on which the gradient descent occurs is a smooth one? If so, is it a valid assumption?

Most of the graphical representations I have seen of gradient descent show a smooth landscape.

This Wikipedia page describes the fitness landscape.

What is the fitness landscape? Is it the surface where gradient descent occurs? To compute the derivatives we need a smooth manifold. So, I think the landscape is smooth. — Swakshar Deb, Apr 09 '21 at 07:54
I am not sure about what fitness landscape mean but if it is the landscape where gradient descent occur. Then I think smooth means there is not discontinuty or sudden jump in the surface, so we can find out the derivatives. — Swakshar Deb, Apr 09 '21 at 07:57
i dont believe it's smooth landscape theoretically, since gradient descent in neural net does partial derivation thus other weights may get to wrong values when a certain weight is optimised; this results oscillation sometimes — Dee, Apr 09 '21 at 09:42
You say "I've come across the concept of fitness landscape before". Can you please provide the link to the research papers or books that use this term "fitness landscape". I think this term occurs more in the context of evolutionary algorithms and not deep learning. — nbro, Apr 09 '21 at 12:57
@nbro: You are right. It does occur mainly in the evolutionary context. I just thought the concept might still be applicable. — Joebevo, Apr 09 '21 at 13:08
Please, provide the link to the papers where you found this term. People are giving answers but maybe based on wrong assumptions, and that makes those answers basically useless, because you didn't provide enough context. — nbro, Apr 09 '21 at 13:09
Just edit your post to include the relevant links and context, and maybe explain that you though that "fitness landscape" would also be applicable in the context of DL. — nbro, Apr 09 '21 at 13:16
So, what do you mean by "fitness landscape" in the context of deep learning? Maybe it would be better to change "fitness landscape" to a more appropriate term. Maybe "loss landscape"? — nbro, Apr 09 '21 at 15:39

htl · Answer 1 · 2021-04-09T18:14:29.867

I'm going to take the fitness landscape to be the graph of the loss function, $\mathcal{G} = \{\left(\theta, L(\theta)\right) : \theta \in \mathbb{R}^n\}$, where $\theta$ parameterises the network (i.e. it is the weights and biases) and $L$ is a given loss function; in other words, the surface you would get by plotting the loss function against its parameters.

We always assume the loss function is differentiable in order to do backpropagation, which means at the very least the loss function is smooth enough to be continuous, but in principle it may not be infinitely differentiable¹.

You talk about using gradient descent to find the global minimiser. In general this is not possible: many functions have local minimisers which are not global minimisers. For an example, you could plot $y = x^2 \sin(1/x^2)$: of course the situation is similar, if harder to visualise, in higher dimensions. A certain class of functions known as convex functions satisfy the property that any local minimiser is a global minimiser. Unfortunately, the loss function of a neural network is rarely convex.

For some interesting pictures, see Visualizing the Loss Landscape of Neural Nets by Li et al.

¹ For a more detailed discussion on continuity and differentiability, any good text on mathematical analysis will do, for example Rudin's Principles of Mathematical Analysis. In general, any function $f$ that is differentiable on some interval is also continuous, but it need not be twice differentiable, i.e. $f''$ need not exist.

This answer seems reasonable to me, i.e. you need differentiability to apply back-propagation, but note that not all neural networks are trained with back-prop (although this may be an assumption, when you use the term "deep learning"). But why do you say that $G$ is a graph? Maybe you mean a space (https://en.wikipedia.org/wiki/Space_(mathematics))? In any case, it seems that you're using the term "graph" not as defined in graph theory. — nbro, Apr 09 '21 at 17:19
@nbro Thanks for the question, I mean graph as in the literal sense of plotting a graph of a function (you can see an example of this usage [here](https://en.wikipedia.org/wiki/Graph_of_a_function#Definition), for example). My choice of notation has clearly added to the confusion as it is common to denote a graph (in the network sense) by $G$, but I don't mean to say $G$ is a network. Hopefully the latest revision clarifies this. — htl, Apr 09 '21 at 18:15

Alexander Soare · Answer 2 · 2021-04-09T09:44:14.480

Main answer

To answer your question as directly as possible: No, deep learning does not make that "assumption".

But you're close. Just swap the word "assumption" with "imposition".

Deep learning sets things up such that the landscape is (mostly) smooth and always continuous*, and therefore it is possible to do some sort of optimization via gradient descent.

* quick footnotes on that bit:

Smoothness is a stronger condition than continuity, that's why I mention them both.
My statement is not authoritative, so take it with a grain of salt, especially the "always" bit. Maybe someone will debunk this in the comments.
The reason that I say "(mostly) smooth" is because I can think of a counter example to smoothness, which is the ReLU activation function. ReLU is still continuous though.

Further elaboration

In deep learning we have linear layers which we know are differentiable. We also have non-linear activations, and a loss function which for the intents of this discussion can be bundled with non-linear activations. If you look at papers which focus specifically on crafting new types of non-linear activations and loss functions you will usually find a discussion section that goes something like "and we designed it this way such that it's differentiable. Here's how you differentiate it. Here are the properties of the derivative". For instance, just check out this paper on ELU, a refinement on ReLU.

We don't need to "assume" anything really, as we are the ones who designate the building blocks of the deep learning network. And the building blocks are not all that complicated in themselves, so we can know that they are differentiable (or piecewise differentiable like ReLU). And for rigor, I should also remind you that the composition of multiple differentiable functions is also differentiable.

So hopefully that helps you see what I mean when I say deep learning architects "impose" differentiability, rather than "assume" it. After all, we are the architects!

Kostya · Answer 3 · 2021-04-09T17:58:14.040

-2

Does deep learning assume that the fitness landscape on which the gradient descent occurs is a smooth one?

One can interpret this question from a formal-mathematical standpoint and from a more "intuitively-practical" standpoint.

From the formal point of view, smoothness is the requirement that the function is continuous with continuous first derivatives. And this assumption is quite often not true in lots of applications - mostly because of the widespread use of ReLU activation function - it is not differentiable at zero.

From the practical point of view, though, by "smoothness" we mean that the function's "landscape" does not have a lot of sharp jumps and edges like that:

Practically, there's not much difference between having a discontinuous derivative and having derivatives making very sharp jumps.

And again, the answer is no - the loss function landscape is extremely spiky with lots of sharp edges - the picture above is an example of an actual loss function landscape.

But... why the gradient descent works then?

As far as I know, this is a subject of an ongoing discussion in the community. There are different takes and some conflicting viewpoints that are still subject of a debate.

My opinion is that, fundamentally, the idea that we need it to converge to the global optimum is a flawed one. Neural networks was shown to have enough capacity to completely remember the training dataset. A neural network, that completely remembered the training data. has reached the global optimization minimum (given only the training data). We are not interested in such overtrained models - we want models that generalize well.

As far as I know, there is no conclusive results on which properties of the minimum are linked to ability to generalize. People argued that these should be the "flat" minima, but then it was refuted. After that a "wide optimium" term was introduced and gave rise of an interesting technique of Stochastic Weight Averaging.

edited Apr 09 '21 at 17:58

answered Apr 09 '21 at 09:55

Kostya

2,416
7
23

1

Why would we be looking for the "widest" minimum? Moreover, although I think what you mean, you should explain what you mean by "widest minimum" (maybe a more common term is "flat minimum"). I really don't understand why we would not be looking for the **global** minimum. The global minimum/maximum is the optimal solution. Not sure what you're talking about here and where your conclusions come from. – nbro Apr 09 '21 at 15:41
@nbro Would you agree that a neural network, that completely remembered the training data, has reached the global optimization minimum? – Kostya Apr 09 '21 at 16:53
1

Mathematics is not about agreeing on something or not. It's about making assumptions or starting from premises and follow the rules to reach some conclusions. In any case, that would be the global optimum but for the training dataset. In reality, as you know, we're not trying to overfit the training data but to find a function that generalizes well to unseen data too. The thing is: I don't know where your conclusion that we're trying to find the "widest minimum" comes from and why would that mean "better generalization", which is what you seem to mean/imply now. – nbro Apr 09 '21 at 17:01
1

To be honest, the question is not even clear. The author is talking about the "fitness landscape", but they didn't even clarify what that would mean in the context of deep learning. They are probably talking about "loss landscape", but it's also not clear. I'm not an expert on the topic, but I don't think that your conclusion (last part) is well motivated. The first paragraphs seem reasonable to me, though, although, again, I am not a mathematician and not an expert on this specific topic of "loss landscapes", whatever the definition is. – nbro Apr 09 '21 at 17:03
@nbro I thought that my three last sentences unambiguously stated that I'm talking about generalization... If you have a suggestion on how to rephrase that - I'd happily do that. I didn't want to go into formal details of it because it was my slightly opinionated aside on the subject. Also, I do agree that the question is not well-phrased, but demanding mathematically precise formulation from beginners is not very productive, wouldn't you say? – Kostya Apr 09 '21 at 17:09
In your last paragraph, you indeed say "we want generalization". That's correct. But where does the "widest minimum" come from and what does it even mean? It's not clear. That's what I'm questioning. – nbro Apr 09 '21 at 17:13
Furthermore, I think that giving an answer with many mathematical details to such a poorly formulated question may not be easy to follow for the beginner, but it definitely would clarify certain doubts, if you understand the math. What I think, generally, is that "no information is better than misleading information", so opinions are typically not very useful and this site is not about opinions, but more about facts. If the question is "Do you think that X is Y?", then, ok, you could say your opinion, but these questions are not suitable for our site. – nbro Apr 09 '21 at 17:14
@nbro I just used "wide minimum" as an intuitive analog of "minimum that generalizes well". I can remove "wide" if you insist. But I don't feel like using "flat" in that context, because there's some research that shown that minimum curvature is not necessarily linked to its generalization performance (as well lts "globalness"). – Kostya Apr 09 '21 at 17:23
I think you should just define what you mean by that. What kind of minimum are we exactly looking for and, more importantly, why do you say that we're not looking for the global minimum and clarify that you mean the global minimum given only the training data. – nbro Apr 09 '21 at 17:25
@nbro Done. Actually I did a refresher on literature and "wide" is a pretty widely (heh) used term in publications on the subject... – Kostya Apr 09 '21 at 18:00
@Kostya, if the loss function is not smooth, how can we compute derivatives? – Swakshar Deb Apr 09 '21 at 18:36
@Kostya, yes RELU is not differentiable at $x=0$, but we assume the derivate is zero at $x=0$. – Swakshar Deb Apr 09 '21 at 18:38
@Kostya, I don't understand how you create link between gradient descent on the function that is not smooth and finding the local minima or better minima(may be your saying it as wider minima). Since we are standing on a function that has many local minima it is natural to stuck in a local minima using gradient descent. I don't see how non smooth fuction and getting local minima linked together. – Swakshar Deb Apr 09 '21 at 19:04

Does gradient descent in deep learning assume a smooth fitness landscape?

3 Answers3

Main answer

Further elaboration