2

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

My understanding is that neurons can only produce values between 0 and 1, and that this assumption can be used in things like cross-entropy. Are my assumptions just completely wrong?

Is there any reference that explains this?

nbro
  • 39,006
  • 12
  • 98
  • 176

2 Answers2

2

Why wouldn't they work?

Each neuron's output is equal to a function over the sum of all its weights multiplied by their corresponding neurons. If that function is the Sigmoid function, then the output is squashed from $[0,1]$. If the entire layer uses a SoftMax function, then the output of all neurons is squashed from $[0,1]$ and their sum equals 1. In other others, they represent a set of probabilities, where you can then use cross-entropy to optimize their values (cross-entropy measures the difference between two probability distributions).

ReLU and ELU are simply other types of functions, whose output is not limited to the range $[0, 1]$. They are differentiable, like other activation functions, and so they can be used in any neural network.

nbro
  • 39,006
  • 12
  • 98
  • 176
BlueMoon93
  • 906
  • 5
  • 16
  • I think the OP was confused about the fact that the cross-entropy _may_ require a probability vector as input, so, by using activation functions that do not have values in the range $[0, 1]$, the output of the neural network may not be a probability vector. You partially address this by mentioning the softmax, but I think this answer could definitely be improved by explaining more in detail the cross-entropy, and it's relation to the softmax and other activation functions of the hidden neurons. – nbro Jan 24 '21 at 22:58
1

Christopher Olah's blog post describes it better that I ever could. Basically, most data we come across can't be separated with a single line, but with some kind of curve. Non-linearities allow us to distort the input space in ways that make the data linearly separable, making classification more accuarate.

Daniel
  • 326
  • 2
  • 9
  • This does not really answer the question. This answers another question, which is "Why do we need non-linearities?". Please, read the **actual** question again. – nbro Jan 24 '21 at 22:43