Why are the non-linear activations in deep nets not learned?

Question

Why can we not parametrize and learn the non-linear activations? For example, if we look at leaky ReLu which equals to $f(y)=y$ for $y>0$ and $f(y)=\alpha y$ for $y<0$, it seems that we can differentiate the parameter $\alpha$ with respect to the loss and learn it, why is it not done?

Saying "differentiate the parameters wrt the loss" makes no sense to me. What makes sense is differentiate a function (e.g. the loss function) with respect to some parameter. — nbro, May 03 '23 at 23:39

score 3 · Accepted Answer · answered May 03 '23 at 19:22

The ReLU is the simplest nonlinear function that has shown remarkable performance when used as activation function in NNs. Note that the derivative is binary, either zero or one, just depending on the sign of the input. This makes ReLU very fast and convenient to use.

Since its first usage, loads of other ReLU like functions have been proposed, for your specific case, pytorch has the PReLU function already implemented, which has the learnable parameter $a$ like in your function.

Why are the non-linear activations in deep nets not learned?

1 Answers1