I was recently reading Hinton's GLOM idea How to represent part-whole hierarchies in a neural network, and I am simply unsure about what exactly he means when he says parsing images into "part-whole hierarchies".
Moreover, wouldn't semantic segmentation "parse" the parts from the whole image? So what is different here?