2

My training of Resnet-18 network on Imagenet using Tesla V100 seems to be quite slow (1 epoch is about 2,5 hours, batch 128). Increasing the number of GPUs does not seem to help.

What is your training time of Resnet-18/Resnet-50 on Imagenet? How many epochs do you train for to obtain the desired accuracy? I am wondering what I should expect.

nbro
  • 39,006
  • 12
  • 98
  • 176
cerebrou
  • 141
  • 1
  • 3
  • Do you double the batch size after increasing the number of gpus? So weird that the training speed still constant – CuCaRot May 02 '22 at 15:58

1 Answers1

1

Currently, when using the code on this branch: https://github.com/benchopt/benchmark_resnet_classif/pull/53, I use 35 minutes per epoch to train a ResNet-18 on ImageNet in TensorFlow with a V100 GPU and a batch size of 128, with standard data augmentations.

I haven't found other mentions of the training times for a ResNet-18 with a standard training policy, so I am just mentioning this to kick off the conversation without claiming that this is the best one can get.

EDIT

With timm it is possible on a single V100 GPU to reach 0.1s per iteration with 128 batch size (see this discussion I had with Ross Wightman). In total this gives a 16-minute epoch. I think this is a very good baseline and you can make it even faster with AMP, a larger batch size and of course distributing. My implementation here in PyTorch, achieves 22' per epoch, I might be missing some optimizations here and there.