0

there is one problem which bugs me quite a long time, it is the non-convex loss shape (multiple minima, e.g. shown here) of neural networks which use a quadratic loss function.

Question: Why is a “common” AI problem usually non-convex with multiple minima, although we are using e.g. quadratic loss functions (which is in lectures usually drawn as a simple,convex,quadratic function such as x^2)?

My guess:

Is it because we are feeding the loss function with our highly non-linear model-output and therefore the resulting total loss surface is highly non-linear/ non-convex? Specifically, the quadratic loss is just approximating the infinitesimal small neighbourhood around a specific point (minibatch) as quadratic. Is that guess correct? This would imply, that highly-non linear / very deep and complex models have a highly-non linear resulting loss-surface, while shallower models have less minima and a one-layer network has a convex shape ?

horsti
  • 3
  • 2

1 Answers1

1

If I understood your question correctly - the quadratic loss function is not always convex. Its' convexity depends on the input it takes. For example, consider a very basic NN that takes some input $x$ and returns \begin{equation} f(x)=relu(w_1relu(w_0x+b_0)+b_1). \end{equation} Using quadratic loss, or $\ell_2$ is minimizing the objective $\sum_i(x_i-f(x_i))^2$, but even when examining only one of the terms (say $i=0$), which is $(x_0-f(x_0))^2$, we see that it is not convex at all, as the following plot shows:

                                          plot

Hadar Sharvit
  • 371
  • 1
  • 12
  • Yes, you did. So the convexity of the loss depends on both inputs (prediction_function and groundtruth_function) ? As soon as one is non-convex, the loss will also be non-convex right? What confuses me, I guess most of the real-world problems will be non-convex and thus would always lead to a non-convex loss, and I thought simple linear classifiers (1/1+e(-(wx+b))) are convex ? Or does the convexity just depends on (in your example) f(x_i) and not as well x_i ? – horsti Sep 16 '22 at 15:01
  • Generally speaking, you are correct as the composition of a convex function with a non-convex one may very well be non-convex. Notice that a sigmoid function is not convex (think of it as an "S" shape) – Hadar Sharvit Sep 16 '22 at 15:11
  • So the convexity of the loss is also depending on my underlying groundtruth function, which I try to approximate, right ? So it is very likely that a very complex task, which I try to solve with a linear classifier would yield to a non-convex loss-function (although I am using the BinaryCrossEntropy loss) ? Or does the loss-convexity just depend on the "prediction_function" ? – horsti Sep 16 '22 at 15:31