1

Reading through the paper on the Noisy Student algorithm, I have a quick question about how the initial teacher model is built.

In step 1 of the algorithm, the loss function is defined such that it looks like the initial teacher model is trained using noise. But then when you get to step 2 it says the teacher model you use to generate your labels for the unlabeled data should not be noised.

So, should you be adding noise or not for the first teacher model that you train?

nbro
  • 39,006
  • 12
  • 98
  • 176
lamyvista
  • 13
  • 5

1 Answers1

0

Short answer: yes, the teacher model should always train with noise.

The noise here is just the augmentation and regularization that you may feel familiar with, the author used three kinds of noises:

  • Data augmentation
  • Dropout
  • Stochastic Depth (drop out for residual block).

In step 2, the authors said that only the sample which has a high score (high confidence)

Specifically, we filter images that the teacher model has low confidences on since they are usually out-of-domain images.

And they even add or remove samples to ensure the distribution of the dataset

we duplicate images in classes where there are not enough images. For classes where we have too many images, we take the images with the highest confidence.

And the most important thing, they use the bigger (more capacity, more parameters) model as a student

we want the student to be better than the teacher by giving the student model enough capacity

Now, just think simple, those methods help to enhance the model, which makes the model stronger so teachers, students, or any model should apply those to the training phase and remove them to predict at the test phase (teacher predicts labels).

In my opinion, this method is just like label smoothing but replaces the hyper-parameters with the learning one (the teacher).

CuCaRot
  • 892
  • 3
  • 15