Trying to understand VGG convolution neural networks architecture

Question

Trying to understand the VGG architecture and I have these following questions.

I understand the general understanding of increasing filter size is because we are using max pooling and so its image size gets reduced. So in order to keep information gain, we increase filter size. But the last few layers in the VGG architecture, the filter size remained same when vgg was max pooling from 14x14 to 7x7 image size, the filter size remained same at 512x512. Why wasn’t there the need to increase filter size there?
Also few consecutive layers, in the end, was constructed with both same filter and image size, those layers were built just to increase accuracy? (experimentation?)
And I couldn’t wrap around that visualization at final filters have the entire face as the feature as I understood through convolution visualizing (Matt Zieler video explanation). But max pooling causes us to see only a subset part of the image right? When filter size is 512x512 (face as the filter/feature) the image size as 7x7, so how does entire face as a filter will work on images when we are moving over small subset of the image pixels?

Go through the following blog to get a deep understanding of VGG-16 architecture: The Architecture and Implementation of VGG-16 (https://medium.com/towards-artificial-intelligence/the-architecture-and-implementation-of-vgg-16-b050e5a5920b) — Vaibhav Khandelwal, Dec 03 '20 at 16:24

score 2 · Accepted Answer · answered Dec 13 '17 at 03:18

Good questions. Let me reply one by one.

1- Filter size can be increased. There is no limit for it. However, think two cases:

DNN part. Shape will be 1024 x 7 x 7 and it will map to 4096 features, it will cause 204M parameters at dense_1 layer. This change will cause two possible problem. Overfitting and inference/training speed.
Sparsity. You can implement 1024 features at 5th conv block and train the network and check the accuracy to see if increased feature count will make any impact or not.

So, to decide the number of con layers, it is a good practice to check sparsity of feature layers.

2- If I did not understand your question wrong, those layers are DNN classifier as a fully connected layer.

3- Maxpooling is not causing us to see a subset of the image. It is just an algorithm to resizing. What you see in that video is the filter response to some objects exist in the image. In deeper level, do not expect to have visual shape like in shallow layers. You may even get single white pixel at deepest layer and it may be a response to an object like human face.

Hi, Thank you. I can understand now on filter size. And I can now visualize better on max pooling results as well. But about point 2, conv2d_8 and the 9th layer has same hyper-parameters and those layers don't act as a classifier right? Convolution2D helps in better activation values for edge detection, was just wondering how they arrived at two conv2d layers at later stages of the layer with same hyper-parameters. So playing around with changes in vgg architecture to what gave that thought instead of just one of such layer. Also flatten layer is the one we call fully connected layer? — Rajesh Mappu, Dec 13 '17 at 18:42

Trying to understand VGG convolution neural networks architecture

1 Answers1