0

I am reading Retinaface paper,

RetinaFace: Single-stage Dense Face Localisation in the Wild Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, Stefanos Zafeiriou

link: https://arxiv.org/abs/1905.00641

One of the questions we aim at answering in this paper is whether we can push forward the current best performance (90.3% [67]) on the WIDER FACE hard test set [60] by using extra supervision signal built of five facial landmarks.

enter image description here

Question: here what is mean by Extra-supervision and Selfsupervision ?

Also suggest some good resources for understanding this paper better.

mat
  • 3
  • 3
  • Without reading the paper, extra supervision probably only means that you use other supervisory signals, in addition to the existing ones, like the picture suggests. Now, the more interesting questions are - 1. what task(s) are we actually trying to solve here? 2. How are these different signals combined to train the model to solve those tasks? I'd recommend that you start by reading the answers in [this post](https://ai.stackexchange.com/q/10623/2444) in order to get familiar with SSL, if you haven't already. – nbro Jul 05 '23 at 07:59
  • @nbro,Thanks for the suggestion on the SSL , it would be great to know how extra supervision is being used in tasks – mat Jul 06 '23 at 03:01

1 Answers1

0

Basically:

  • The extra-supervision refers to a facial landmark regression loss, $L_\text{pts}$, (section 3.1), where basically one branch of the network also predicts 5 facial landmarks (center of eyes, nose, and sides of mouth) from which you do a simple regression on each point. Intuitively the landmarks adds information about the face pose, so predicting this extra info may allow the network to attain better performance.
  • The self-supervision part is not a typical SSL loss but actually they have this additional dense regression branch (section 3.2) in which they use computer vision techniques to yield a 3D mesh of the input face image, and so you have an additional pixel-wise loss ($L_\text{pixel}$) on the 2d projections of these 3d face meshes. Anyway, you can still consider it "self-supervision" in the sense that you have no extra labels (so no additional supervision) but you extract the 3d meshes from the (unlabeled) images input to the networks.

Lastly, all the losses are balanced with coefficients $\lambda = [0.25, 0.1, 0.01]$ as showed in equation 1.

Luca Anzalone
  • 2,120
  • 2
  • 13
  • so extra-supervision is all about the information learned from supervision task can be used for other tasks , but still im confused about SSL here , as it looks more like unsupervised learning. could you suggest any good resource for getting intuition about the 3d mesh decoders , as im a beginner here its hard to get a grasp on it – mat Jul 06 '23 at 03:19
  • @mat In this context, extra-supervision and SSL are unrelated. By extra-supervision you annotate your data with additional labels (e.g. the five landmarks), instead, by SSL you try to create supervision on the fly from your own input, thus without extra annotations. To understand 3D computer vision you need to learn about computer graphics and CV itself, you need to take some courses and/or read books. Anyway, you can start follow the references provided in section 3.2 – Luca Anzalone Jul 06 '23 at 09:55
  • Thank you for your valuable answer – mat Jul 07 '23 at 01:40