I'd like to ask you how do we know that neural networks start by learning small, basic features or "parts" of the data and then use them to build up more complex features as we go through the layers. I've heard this a lot and seen it on videos like this one of 3Blue1Brown on neural networks for digit recognition. It says that in the first layer the neurons learn and detect small edges and then the neurons of the second layer get to know more complex patterns like circles... But I can't figure out based on pure maths how it's possible.
2 Answers
We do it experimentally; you're able to look at what each layer is learning by tweaking various values throughout the network and doing gradient ascent. For more detail, watch this lecture: https://www.youtube.com/watch?v=6wcs6szJWMY&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=12 it provides many methods used for understanding exactly what your model is doing at a certain layer, and what features it has learnt.

- 1,346
- 8
- 21
-
Thank you for the lecture it's so interesting, there a couple of things I didn't get (like in the nearest neighbors part how do we map the images from the training set to the feature space which is comprised of 4096 dimensional vectors) but I got the general idea of each technique. Thank you again it's like a jewel ! – Daviiid Feb 26 '21 at 06:11
The network architecture is relevant to this question.
Convolutional neural network architectures enforce the building up of features because the neurons in earlier layers have access to a small number of input pixels. Neurons in deeper layers are connected (indirectly) to more and more pixels, so it makes sense that they identify larger and larger features. Lots of the visual examples available online which show, for example, a curve, to a circle, to a part of an animal, to a whole animal, are based on convolutional networks. The beautiful examples from the Harvard lecture in the other answer use convolutional networks.
With that being said, increasing complexity with each layer is true generally, including for dense architectures like the 3Blue1Brown one. It's just that this is a more abstract 'increase in nonlinearity' rather than spatial feature size. Depending on the task the network is learning, earlier layers will be more 'basic', but their neurons might use large areas of the input.

- 83
- 5
-
1Oh I see thank you. So may we say there is a trade-off between the size of the input's areas that the first layers' neurons are using and complexity ? Like Convolutional Neural Networks work better than Dense Neural Networks with images because the first neurons can learn complex detail by looking at small areas of the input while for a dense fully connected network than the neurons look at everything. And are there other interpretations for the other types of architectures or we just say that they find and learn small patterns in the data regardless of what it is by analogy to CNNs ? – Daviiid Feb 26 '21 at 22:38
-
Hi @Daviiid. I would not say that there is a trade off as you describe. There are some theorems that deeper layers can learn more complex features, for example the first layer cannot learn XOR but deeper layers can. However, convolutional layers do not learn more complex features than dense layers in that sense. I would say instead that convolutional layers work better because they follow the symmetry of the task. We effective apply some assumed knowledge we have about the task through them. This allows training to be faster and more effective. But not really achieve more complexity. – user7834 Apr 12 '21 at 20:26
-
I see, thank you for your answer. I understand better. I guess it's like with Recurrent Neural Networks. We have this knowledge that time series data is dependent on past data. It's like if we're working with networks that preserve this characteristic structure of the data. – Daviiid Apr 14 '21 at 02:22