In computer vision is very common to use supervised tasks, where datasets have to be manually annotated by humans. Some examples are object classification (class labels), detection (bounding boxes) and segmentation (pixel-level masks). These datasets are essentially pairs of inputs-outputs which are used to train Convolutional Neural Networks to learn the mapping from inputs to outputs, via gradient descent optimization. But animals don't need anybody to show them bounding boxes or masks on top of things in order for them to learn to detect objects and make sense of the visual world around them. This leads me to think that brains must be performing some sort of self-supervision to train themselves to see.
What does current research say about the learning paradigm used by brains to achieve such an outstanding level of visual competence? Which tasks do brains use to train themselves to be so good at processing visual information and making sense of the visual world around them? Or said in other words: how does the brain manage to train its neural networks without having access to manually annotated datasets like ImageNet, COCO, etc. (i.e. what does the brain use as ground truth, what is the loss function the brain is optimizing)? Finally, can we apply these insights in computer vision?
Update: I posted a related question on Psychology & Neuroscience StackExchange, which I think complements the question I posted here: check it out