1

I aim to do action recognition in videos on a private dataset.

To compare with the existing state-of-the-art implementations, other guys published their code on Github, like the one here (for the paper Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework). Here, the author first trains the embedding network (3D ResNet without final classification layer) with contrastive learning. Finally, he adds a final layer and finetunes the weights, training the whole network again for some epochs.

Now, here is my doubt: Is there a way, while training just the embedding network, to find the test accuracy?

One way to tell if the final accuracy after finetuning would be good is to see if the training loss is decreasing or not. If the training loss decreases, that certainly builds up the hope that the test accuracy would be improving during the training, but in no way gives an idea about how much the test accuracy would be.

Another way is to plot the t-SNE, see if, on the test data, the data points from the same class are close together, thus forming a cluster. Then it could be said that the test accuracy would also be good. But it's not quantifiable, and hence it would be hard to compare the t-SNE plots obtained from two different models.

I was also suggested to add a final layer to my embedding network and just test it on the test data, without training or fine-tuning again. The reason for that is that the embedding network should have learned the weights reasonably by now; even if I finetune the model, the test dataset's test accuracy won't vary a lot. I need some advice here. Is that suggestion good? Are there any potential pitfalls with this suggestion?

Or do you have any other possible suggestions I could try?

nbro
  • 39,006
  • 12
  • 98
  • 176

0 Answers0