How can a convnet learn with a 3x3 output layer?

Question

I was studying the "Deep Learning with Python" book, I came across this MNIST example and this is how the last conv2d layer looks like:

_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
=================================================================

I have a hard time understanding how can this neural network figure out any feature from such a small image like 3x3 pixels. It would make sense to me if it were something like 10x10 but a 3x3 image makes no sense when I look at one, yet the network can achieve %99+ validation accuracy. How is this possible?

Do you know what `(None, 3, 3, 64)` means? Because it seems that you're assuming that the input image to the convolutional network is 3x3. 3x3 is probably the size of the kernel. So, please, edit your post to clarify that. — nbro, May 29 '22 at 11:27
@nbro I think I have a vague understanding of what these parameters means, I don't think it is the kernel. It is the output shape from a conv2d layer. — Abdullah Akçam, May 29 '22 at 13:24
Well, which software library is the author of the book using? You should edit your post to include this detail. The answer below assumed that you're using TensorFlow/Keras, and that's probably a good guess. — nbro, May 31 '22 at 15:35

score 3 · Answer 1 · answered May 31 '22 at 11:32

I'm going to assume that what you posted is the output of something like model.summary() from TensorFlow/Keras. With that assumption, (None, 3, 3, 64) is the output shape. We can ignore the None, as it usually encodes the batch size.

Thus, the output of the last layer can be seen as a 3x3 image with 64 channels. Alternatively, you can think of it as 64 3x3 images. For more information about shapes, see this question: Keras input explanation: input_shape, units, batch_size, dim, etc

So what is (probably) happening, is that given a large(r) input image, the neural network extracts relevant features which are very good at describing the validation data set.

Answering the question of how this output is enough to describe the validation data set so well is probably a bit harder. If you'd like, you can chalk it up to neural network "magic". But keep in mind that the MNIST dataset isn't really that "hard" - at the end of the day, it is a database of handwritten digits from 0 to 9. So it isn't that suprising that 3 * 3 * 64 = 576 numbers would be capable of describing the data set to a degree that allows high accuracy. In fact, outputting 9 numbers would be enough, if each number encoded the probability of high likely it is that the digit in the image is a 0, 1, 2, etc..

Nonetheless, you might want to take a look at what those 3x3 images look like. In that case, How to Visualize Filters and Feature Maps in Convolutional Neural Networks might be of interest to you.

How can a convnet learn with a 3x3 output layer?

1 Answers1