Questions tagged [relu]

For questions about the rectified linear unit (ReLU) or rectifiers, which is a widely used activation function in neural networks.

49 questions
22
votes
1 answer

What are the advantages of ReLU vs Leaky ReLU and Parametric ReLU (if any)?

I think that the advantage of using Leaky ReLU instead of ReLU is that in this way we cannot have vanishing gradient. Parametric ReLU has the same advantage with the only difference that the slope of the output for negative inputs is a learnable…
gvgramazio
  • 696
  • 2
  • 7
  • 19
13
votes
1 answer

How exactly can ReLUs approximate non-linear and curved functions?

Currently, the most commonly used activation functions are ReLUs. So I answered this question What is the purpose of an activation function in neural networks? and, while writing the answer, it struck me, how exactly can ReLUs approximate a…
user9947
11
votes
2 answers

Why do we prefer ReLU over linear activation functions?

The ReLU activation function is defined as follows $$y = \operatorname{max}(0,x)$$ And the linear activation function is defined as follows $$y = x$$ The ReLU nonlinearity just clips the values less than 0 to 0 and passes everything else. Then why…
10
votes
3 answers

Are ReLUs incapable of solving certain problems?

Background I've been interested in and reading about neural networks for several years, but I haven't gotten around to testing them out until recently. Both for fun and to increase my understanding, I tried to write a class library from scratch in…
9
votes
1 answer

What happens when I mix activation functions?

There are several activation functions, such as ReLU, sigmoid or $\tanh$. What happens when I mix activation functions? I recently found that Google has developed Swish activation function which is (x*sigmoid). By altering activation function can it…
5
votes
2 answers

Why is tf.abs non-differentiable in Tensorflow?

I understand why tf.abs is non-differentiable in principle (discontinuity at 0) but the same applies to tf.nn.relu yet, in case of this function gradient is simply set to 0 at 0. Why the same logic is not applied to tf.abs? Whenever I tried to use…
zedsdead
  • 53
  • 3
5
votes
2 answers

In deep learning, is it possible to use discontinuous activation functions?

In deep learning, is it possible to use discontinuous activation functions (e.g. one with jump discontinuity)? (My guess: for example, ReLU is non-differentiable at a single point, but it still has a well-defined derivative. If an activation…
4
votes
1 answer

Why should one ever use ReLU instead of PReLU?

To me, it seems that PReLU is strictly better than ReLU. It does not have the dying ReLU problem, it allows negative values and it has trainable parameters (which are computationally negligible to adjust). Only if we want the network to output…
4
votes
0 answers

Should batch normalisation be applied before or after ReLU?

I know that there has been some discussion about this (e.g. here and here), but I can't seem to find consensus. The crucial thing that I haven't seen mentioned in these discussions is that applying batch normalization before ReLU switches off half…
4
votes
1 answer

Neural network doesn't seem to converge with ReLU but it does with Sigmoid?

I'm not really sure if this is the sort of question to ask on here, since it is less of a general question about AI and more about the coding of it, however I thought it wouldn't fit on stack overflow. I have been programming a multilayer perceptron…
4
votes
2 answers

Is PReLU superfluous with respect to ReLU?

Why do people use the $PReLU$ activation? $PReLU[x] = ReLU[x] + ReLU[p*x]$ with the parameter $p$ typically being a small negative number. If a fully connected layer is followed by a at least two element $ReLU$ layer then the combined layers…
3
votes
1 answer

Can residual neural networks use other activation functions different from ReLU?

In many diagrams, as seen below, residual neural networks are only depicted with ReLU activation functions, but can residual NNs also use other activation functions, such as the sigmoid, hyperbolic tangent, etc.?
3
votes
1 answer

How are exploding numbers in a forward pass of a CNN combated?

Take AlexNet for example: In this case, only the activation function ReLU is used. Due to the fact ReLU cannot be saturated, it instead explodes, like in the following example: Say I have a weight matrix of [-1,-2,3,4] and inputs of [ReLU(4),…
Recessive
  • 1,346
  • 8
  • 21
3
votes
1 answer

How does backpropagation with unbounded activation functions such as ReLU work?

I am in the process of writing my own basic machine learning library in Python as an exercise to gain a good conceptual understanding. I have successfully implemented backpropagation for activation functions such as $\tanh$ and the sigmoid function.…
2
votes
2 answers

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work? My understanding is that neurons can only produce values between 0 and 1, and that this assumption can be used in things like cross-entropy. Are my…
1
2 3 4