About the choice of the activation functions in the Multilayer Perceptron, and on what does this depends?

Question

I've read in this: F. Rosenblatt, Principles of neurodynamics. perceptrons and the theory of brain mechanisms that in the Multilayer Perceptron the activation functions in the second, third, ..., are all non linear, and they all can be different. And in the first layer, they are all linear.

Why?

On what does this depends?

When it is said "the neural network learns automatically", in colloquial words, what does it mean?

AFAIK, one first train the NN, then at some point NN learns. When does the "automatically" enters then?

Thanks in advance for your help.

Hello Veronica. Could you please focus on 1 question at a time (so that people can focus on 1 problem at a time)? So, I would suggest that you remove the second question and ask it in a different/separate post. Please, try also to put your **specific** question in title to give immediately the idea of what your post is about to the readers. — nbro, Apr 22 '21 at 01:03
@nbro Hello moderator nbro. Ok, but notice that nxglogic has answered both questions, if I edit my question, that would affect their answer, so?. I've modified the title to make it more specific. — Verónica Rmz., Apr 22 '21 at 16:00
Helpful -> https://ai.stackexchange.com/questions/7088/how-to-choose-an-activation-function and their respective answers. — Verónica Rmz., Apr 25 '21 at 01:20

score 1 · Answer 1 · answered Apr 21 '21 at 18:28

1

Rosenblatt was probably discussing a specific architecture, for which there are many. However, for general purpose feed-forward back-propagation ANNs used for function aproximation and classification analysis, you can use whatever activation functions you want on the input-side, hidden layers, and output-side. Examples are identity, logistic, tanh, exponential, Hermite, Laguerre, RBFN, ReLu, softmax, etc. "Automatically," likely refers to the iterative learning process, which tends to be similar to gradient descent, during which partial derivatives of prediction error w.r.t to coefficients decrease.

answered Apr 21 '21 at 18:28

Thank you for your help. – Verónica Rmz. Apr 22 '21 at 15:58
On what does depends the choice of the activation functions? – Verónica Rmz. Apr 22 '21 at 15:59
1

You could take a look at the Chris Bishop book at https://www.academia.edu/35719617/_Christopher_M_Bishop_Neural_Networks_for_Patter_b_ok_org_ (just page down to see pages). ANNs really do like inputs in the range [-1,1], so feature standardization is a good way to start, either that or normalization to range[0,1] then subtract 0.5 from input values. Given this tanh or logit (lostistic) would be my first choice at the hidden layer, then either identity or logistic at the output-side. Softmax is a good choice for classification on the output-side. – Apr 22 '21 at 22:34
1

You have to assess error (e.g., MSE, cross-entropy) based on which functions you use. But also, you should use cross-validation as well. – Apr 22 '21 at 22:34
Is it common to take a single activation function for the entire layer? – Verónica Rmz. Apr 23 '21 at 18:32
Not sure about mixing different activation functions at the same layer for a straighforward ANN; but you would have to have different partial derivatives which could be a nightmare programming. (I don't use open source software). – Apr 23 '21 at 22:38

About the choice of the activation functions in the Multilayer Perceptron, and on what does this depends?

1 Answers1