Since ReLU activations also result in a sparse network, does it have the same "feature selection" property as L1 regularization?

Question

From Deep Learning (Courville, Goodfellow, Bengio), a ReLU activation often "dies" because

One drawback to rectified linear units is that they cannot learn via gradient based methods on examples for which their activation is zero.

Similarly, L1 regularization (as opposed to L2) results in a sparse network

This demonstrates that L2 regularization does not cause the parameters to become sparse, while L1 regularization may do so for large enough α. The sparsity property induced by L1 regularization has been used extensively as a feature selection mechanism.

A couple questions about these topics:

In practice, is there any way/use to prune these "dead" ReLU-activated neurons? And if our trained network performs well with lots of dead neurons, would that imply that a shallower network is a sufficient representation?
Since ReLU activations also result in a sparse network, does it have the same "feature selection" property as L1 regularization? If it does, does this then imply that sigmoid/tanh activations don't have this property?

Please, select one question and remove the others, which you can ask in a separate post! — nbro, Jan 21 '23 at 12:01

Since ReLU activations also result in a sparse network, does it have the same "feature selection" property as L1 regularization?

0 Answers0