How are small scale features represented in an Inverse Graphics Network (autoencoder)?

Question

This post refers to Fig. 1 of a paper by Microsoft on their Deep Convolutional Inverse Graphics Network:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/kwkt_nips2015.pdf

Having read the paper, I understand in general terms how the network functions. However, one detail has been bothering me: how does the network decoder (or "Renderer") generate small scale features in the correct location as defined by the graphics code? For example, when training the dataset on faces, one might train a single parameter in the graphics code to control the (x,y) location of a small freckle. Since this feature is small, it will be "rendered" by the last convolutional layer where the associated kernels are small. What I don't understand is how the information of the location of the freckle (in the graphics code) propagates through to the last layer, when there are many larger-scale unpooling + convolutional layers in-between.

Thanks for the help!

score 2 · Answer 1 · answered Jan 06 '20 at 08:37

Simply said, there is no specific "meaning" to the features generated. They are simply features that are fitted through math and calculus, and nobody knows what they represent exactly, and will never knows. However we can run PCA (Principal Component Analysis) to see which feature is the most "important" of all, aka which feature affects the most in the output image. Then, you can try adjusting the value to manually see and guess what teh value do, but you will never know what exactly it does as it is an arbitrary feature, not specifically set by the network. One value may mean multiple things, or just things we human don't understand. See this amazing video about this for details:

https://youtu.be/4VAkrUNLKSo

This video explains what PCA does and also an example of the features generated by the network.

For small scales features, they may simply be ignored as they don't contribute much to the loss or accuracy, or maybe they are represented by a big dot or something else until the last few layers. With just 80 features one cannot fully represent a face with such details, and with the resolution the networks like these are trained on, small features like these probably won't be shown in the image.

Thank you for this, I really enjoyed the video. If you really wanted to control the precise location of small scale features, could you use like a "bypass" connection that links some encoding neurones directly to the last layer? You could also specify a custom loss function that is very sensitive to particular small features. — natanijelvasic, Jan 06 '20 at 19:04
You maybe can try incorporating a special loss function, maybe focused on large change of color on one specific place — Clement, Jan 06 '20 at 22:43

How are small scale features represented in an Inverse Graphics Network (autoencoder)?

1 Answers1