Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

Question

My understanding is that neurons can only produce values between 0 and 1, and that this assumption can be used in things like cross-entropy. Are my assumptions just completely wrong?

Is there any reference that explains this?

[This discussion](https://stackoverflow.com/q/46659525/712995) might be helpful — Maxim, Jan 09 '18 at 21:09

score 2 · Answer 1 · edited Jan 24 '21 at 22:53

2

Why wouldn't they work?

Each neuron's output is equal to a function over the sum of all its weights multiplied by their corresponding neurons. If that function is the Sigmoid function, then the output is squashed from $[0,1]$. If the entire layer uses a SoftMax function, then the output of all neurons is squashed from $[0,1]$ and their sum equals 1. In other others, they represent a set of probabilities, where you can then use cross-entropy to optimize their values (cross-entropy measures the difference between two probability distributions).

ReLU and ELU are simply other types of functions, whose output is not limited to the range $[0, 1]$. They are differentiable, like other activation functions, and so they can be used in any neural network.

edited Jan 24 '21 at 22:53

nbro

39,006
12
98
176

answered Jan 09 '18 at 20:47

BlueMoon93

906
5
16

I think the OP was confused about the fact that the cross-entropy _may_ require a probability vector as input, so, by using activation functions that do not have values in the range $[0, 1]$, the output of the neural network may not be a probability vector. You partially address this by mentioning the softmax, but I think this answer could definitely be improved by explaining more in detail the cross-entropy, and it's relation to the softmax and other activation functions of the hidden neurons. – nbro Jan 24 '21 at 22:58

score 1 · Answer 2 · answered Feb 09 '18 at 14:21

1

Christopher Olah's blog post describes it better that I ever could. Basically, most data we come across can't be separated with a single line, but with some kind of curve. Non-linearities allow us to distort the input space in ways that make the data linearly separable, making classification more accuarate.

answered Feb 09 '18 at 14:21

Daniel

326
2
9

This does not really answer the question. This answers another question, which is "Why do we need non-linearities?". Please, read the **actual** question again. – nbro Jan 24 '21 at 22:43

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

2 Answers2