4

Neural networks are commonly used for classification tasks, in fact from this post it seems like that's where they shine brightest.

However, when we want to classify using neural networks, we often have the output layer to take values in $[0,1]$; typically, by taking the last layer to be the sigmoid function $x \mapsto \frac{e^x}{e^x +1}$.

Can neural networks with a sigmoid as the activation function of the output layer approximate continuous functions? Is there an analogue to the universal approximation theorem for this case?

nbro
  • 39,006
  • 12
  • 98
  • 176
ABIM
  • 535
  • 1
  • 6
  • 15
  • 1
    [Why sigmoid function instead of anything else?](https://stats.stackexchange.com/q/162988/82135) –  Mar 23 '20 at 16:13
  • @DuttaA That question has some interesting answers. But note that this question is more focused and specific. The other question is about the _general_ use of sigmoids for neural networks and logistic regression. – nbro Mar 23 '20 at 16:21
  • @nbro I have seen some books have mentioned that there might indeed be some justification to using exponentials, I am not exactly sure how (I didn't check the maths behind it). –  Mar 23 '20 at 16:24
  • Why not polyfit? – Sergei Krivonos Mar 29 '20 at 01:16

1 Answers1

3

As far as I know, the sigmoid is often used as the activation function of the output layer mainly because it is a convenient way of producing an output $p \in [0, 1]$, which can be interpreted as a probability, although that can be misleading or even wrong (if you interpret it as an uncertainty too).

You may require the output of the neural network to be a probability, for example, if you use a cross-entropy loss function, although you could in principle produce only $0$s or $1$s. The probability $p$ can then be used to decide the class (or label) of the input. For example, if $p > \alpha$, then you purposedly decide that the input belongs to class $1$, otherwise, it belongs to class $0$. The parameter $\alpha$ is called the classification (or decision) threshold. The choice of this threshold can actually depend on the problem and it is one of the reasons people use the AUC metric, i.e. to avoid choosing this classification threshold.

Can neural networks with a sigmoid as the activation function of the output layer approximate continuous functions? Is there an analogue to the universal approximation theorem for this case?

The most famous universal approximation theorem for neural networks assumes that the activation functions of the units of the only hidden layer are sigmoids, but it does not assume that the output of the network will be squashed to the range $[0, 1]$. To be more precise, the UAT (theorem 2 of Approximation by Superpositions of a Sigmoidal Function, 1989, by G. Cybenko) states

Let $\sigma$ be any continuous sigmoidal function. Then finite sums of the

G(x)=Nj=1αjσ(yTjx+θj)

are dense in $C(I_n)$.

In other words, given any $f \in C(I_n)$ and $\epsilon > 0$, there is a sum, $G(x)$, of the above form, for which

|G(x)f(x)|<ϵ

Here, $f$ is the continuous function that you want to approximate, $G(x)$ is a linear combination of the outputs of $N$ (which should be arbitrarily big) units of the only hidden layer, $I_n$ denotes the $n$-dimensional unit cube, $[0, 1]^n$, $C(I_n)$ denotes the space of continuous functions on $I_n$, $x \in I_n$ (so the assumption is that the input to the neural network is an element of $[0, 1]^n$, i.e. a vector $x \in \mathbb{R}^n$, whose entries are between $0$ and $1$) and $y_j$ and $\theta_j$ are respectively the weights and bias of the $j$ unit. The assumption that $f$ is a real-valued function means that $f$ can take any value on $\mathbb{R}$ (i.e. $f: [0, 1]^n \rightarrow \mathbb{R}$). You should note that $G(x)$ is the output of the neural network, which is a combination (where the coefficients are $\alpha_j$) of the outputs of the units in the only hidden layer, so there's no restriction on the output of $G(x)$, unless you restrict $\alpha_i$ (but, in this theorem, there's no restriction on the values $\alpha_j$ can take).

Of course, if you restrict the output of the neural networks to the range $[0, 1]$, you cannot approximate all continuous functions of the form $f: [0, 1]^n \rightarrow \mathbb{R}$ (because not all of these functions will have the codomain $[0, 1]$)! However, the sigmoid has an inverse function, i.e. the logit, so you can reverse the output of such a neural network. So, in this sense (i.e. by reversing the output of the sigmoid), a neural network with a sigmoid as the activation function of the output layer can potentially approximate any continuous function too.

The UAT above only states the existence of $G(x)$ (i.e. it's an existence theorem). It doesn't tell you how you can find $G(x)$. So, if you use a sigmoid as the activation function of the output layer or not is a little bit orthogonal to the universality of neural networks.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • 2
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/105895/discussion-on-answer-by-nbro-is-the-use-of-the-sigmoid-as-the-activation-of-the). – nbro Mar 23 '20 at 22:45
  • So basically, people infer it from the universal approximation theorem but there is no actual paper... – ABIM Mar 28 '20 at 08:00
  • 1
    @AIM_BLB I don't know if there's a paper or not that formally shows that, if the output of the neural network is a sigmoid, then the NN is able to approximate a class X of functions, but if you have $\sigma(G(x))$, you can retrieve $G(x)$ by applying the inverse of $\sigma$, i.e. $\sigma^{-1}(\sigma(G(x))) = G(x)$. This is more what I meant. – nbro Mar 28 '20 at 12:34
  • @nbro have you ever come across a book or paper vagely stating this argument. I've been looking but I cant find anything... – ABIM Apr 16 '20 at 23:32
  • @AnnieTheKatsu Stating what exactly? – nbro Apr 16 '20 at 23:38
  • I came across this post and saw your above comment. It seems like your argument is "common knowledge" but I cant find it written down anywhere... – ABIM Apr 16 '20 at 23:42
  • @AnnieTheKatsu If you are talking about the comment above, the point of that comment is that you can reverse the output of a sigmoid, so, **in this sense**, a net with a sigmoid at the output still follows the UAT. – nbro Apr 16 '20 at 23:45
  • @AnnieTheKatsu Maybe it helps if you think about it in this other way. Suppose you have a net without a sigmoid at the output. Now, provided the conditions to achieve the universality (e.g. sufficiently number of neurons, etc.), then this network can approximate any continuous function. Now, you have this network that can approximate any continuous function, then you apply the sigmoid to it. This will squash the original output to $[0, 1]$, but you can reverse the output of the sigmoid again to obtain the original output. – nbro Apr 16 '20 at 23:49
  • I know, but the issue is I'm looking for a source I can site for this (to show traction in the paper Im writing) – ABIM Apr 16 '20 at 23:50
  • @AnnieTheKatsu I think I recently came across a paper that goes into the details of networks with probabilities as output, but I don't remember which paper was that. – nbro Apr 16 '20 at 23:50
  • If you could find that it would be fantastic... It lookd to me that this is "folklore" which makes it hard to pin down... – ABIM Apr 16 '20 at 23:51
  • 1
    @AnnieTheKatsu Actually, I think that the paper that I am referring to is actually the paper by Cybenko that I am citing in my answer, i.e. "Approximation by Superpositions of a Sigmoidal Function". Although I didn't read that part fully, if I recall correctly, it seemed to me that Cybenko goes into the direction of "networks that classify". Have a look at it. I may be wrong, but the part "We now demonstrate the implications of these results in the context of decision regions" could be useful. Let me know. – nbro Apr 16 '20 at 23:53
  • I looked, the closest is Theorem 3 but still it doesnt use a soft-max/logistic output layer only appeals to Lusin's theorem... – ABIM Apr 17 '20 at 00:54
  • 1
    @AnnieTheKatsu I would need to think about it and read that part more carefully. But I think we are not asking the right question! What do we want to show? Again, note that I am not saying in my answer that a neural network with a sigmoid as output can approximate any continuous function. In fact, I am more saying the opposite. It cannot because not all functions have the range $[0, 1]$. But, as I said, this range can be converted to the reals and vice-versa. – nbro Apr 17 '20 at 01:07
  • 1
    Maybe you can extend the UAT to this case by assuming that both $G(x)$ and $f(x)$ can be first converted to a sigmoid, and then maybe you can show that the UAT still applies with this _change of variable_. – nbro Apr 17 '20 at 02:49