I'm going to assume that what you posted is the output of something like model.summary() from TensorFlow/Keras. With that assumption, (None, 3, 3, 64) is the output shape. We can ignore the None, as it usually encodes the batch size.
Thus, the output of the last layer can be seen as a 3x3 image with 64 channels. Alternatively, you can think of it as 64 3x3 images. For more information about shapes, see this question: Keras input explanation: input_shape, units, batch_size, dim, etc
So what is (probably) happening, is that given a large(r) input image, the neural network extracts relevant features which are very good at describing the validation data set.
Answering the question of how this output is enough to describe the validation data set so well is probably a bit harder. If you'd like, you can chalk it up to neural network "magic". But keep in mind that the MNIST dataset isn't really that "hard" - at the end of the day, it is a database of handwritten digits from 0 to 9. So it isn't that suprising that 3 * 3 * 64 = 576 numbers would be capable of describing the data set to a degree that allows high accuracy. In fact, outputting 9 numbers would be enough, if each number encoded the probability of high likely it is that the digit in the image is a 0, 1, 2, etc..
Nonetheless, you might want to take a look at what those 3x3 images look like. In that case, How to Visualize Filters and Feature Maps in Convolutional Neural Networks might be of interest to you.