14

Multiple resources I referred to mention that MSE is great because it's convex. But I don't get how, especially in the context of neural networks.

Let's say we have the following:

  • $X$: training dataset
  • $Y$: targets
  • $\Theta$: the set of parameters of the model $f_\Theta$ (a neural network model with non-linearities)

Then:

$$\operatorname{MSE}(\Theta) = (f_\Theta(X) - Y)^2$$

Why would this loss function always be convex? Does this depend on $f_\Theta(X)$?

nbro
  • 39,006
  • 12
  • 98
  • 176
user74211
  • 141
  • 1
  • 3
  • Although this is an older question, the other [duplicate](https://ai.stackexchange.com/q/11979/2444) has an accepted answer, and, overall, the answers there have more upvotes, so they seem to be more useful, so I marked this one as a duplicate of the newer one. – nbro Nov 16 '21 at 18:24

2 Answers2

5

Answer in short: MSE is convex on its input and parameters by itself. But on an arbitrary neural network it is not always convex due to the presence of non-linearities in the form of activation functions. Source for my answer is here.

varsh
  • 562
  • 7
  • 19
3

Convexity

A function $f(x)$ with $x ∈ Χ$ is convex, if, for any $x_1 ∈ Χ$, $x_2 ∈ Χ$ and for any $0 ≤ λ ≤ 1$, $$f(λ x_1 + (1 − λ) x_2) ≤ λf(x_1) + (1 − λ) f (x_2).$$

It can be proven that such convex $f(x)$ has one global minimum. A unique global minimum eliminates traps created by local minima that can occur in algorithms that attempt to achieve convergence on a global minimum, such as the minimization of an error function.

Although an error function may be 100% reliable in all continuous, linear contexts and many non-linear contexts, it does not mean the convergence on a global minimum for all possible non-linear contexts.

Mean Square Error

Given a function $s(x)$ describing ideal system behavior and a model of the system $a(x, p)$ (where $p$ is the parameter vector, matrix, cube, or hypercube and $1 ≤ n ≤ N$), created rationally or via convergence (as in neural net training), the mean square error (MSE) function can be represented as follows.

$$e(β) := N^{-1} \sum_{n} [a(x_n) − s(x_n)]^2$$

The material you are reading is probably not claiming that $a(x, p)$ or $s(x)$ are convex with respect to $x$, but that $e(β)$ is convex with respect to $a(x, p)$ and $s(x)$ no matter what they are. This later statement can be proven for any continuous $a(x, p)$ and $s(x)$.

Confounding the Convergence Algorithm

If the question is whether a specific $a(x, p)$ and method of achieving an $s(x)$ that approximates the $a(x, p)$ within a reasonable MSE convergence margin can be confounded, the answer is, "Yes." That is why MSE is not the only error model.

Summary

The best way summarize is that $e(β)$ should be defined or chosen from a set of stock convex error models based on the following knowledge.

  • Known properties of the system $s(x)$
  • The definition of the approximation model $a(x, p)$
  • Tensor used to generate the next state in the convergent sequence

The set of stock convex error models certainly includes the MSE model because of its simplicity and computational thrift.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62