Short answer: yes, the teacher model should always train with noise.
The noise here is just the augmentation and regularization that you may feel familiar with, the author used three kinds of noises:
- Data augmentation
- Dropout
- Stochastic Depth (drop out for residual block).
In step 2, the authors said that only the sample which has a high score (high confidence)
Specifically, we filter images that the
teacher model has low confidences on since they are usually out-of-domain images.
And they even add or remove samples to ensure the distribution of the dataset
we duplicate images in classes where there are not enough images. For classes where we have too many images, we take the images with the highest confidence.
And the most important thing, they use the bigger (more capacity, more parameters) model as a student
we want the student to be better than the teacher by giving the student model enough capacity
Now, just think simple, those methods help to enhance the model, which makes the model stronger so teachers, students, or any model should apply those to the training phase and remove them to predict at the test phase (teacher predicts labels).
In my opinion, this method is just like label smoothing but replaces the hyper-parameters with the learning one (the teacher).