7

I was prompted towards this question while trying to find server racks and motherboards which are specialized towards artificial intelligence. Naturally I went to the SuperMicro website. There the chassis+motherboard which supported the maximum GPUs in the "artificial intelligence" category could support upto 8 of them. Additionally, the Nvidia DGX-1 also has only 8 Tesla P100 GPUs. And just to rub it in, Matlab does not support more than 8 GPUs last I checked.

So, are more than 8 GPUs practical for DL? I would take Caffe, CNTK, Tensorflow and Torch7 as reference.

Blaszard
  • 1,027
  • 2
  • 11
  • 25
Rushat Rai
  • 139
  • 8

1 Answers1

3

I did some recent research on this topic. It all comes down to parallelization.
Basically there are 2 ways to do it: model parallelization or batch parallelization.

Model parallelization is when you split the model by layers among multiple GPUs. As per my best knowledge you can't split a layer between GPUs, so 8 GPUs would serve 8 layers that is very extensive. Tensorflow supports this method. In my opinion more than 6 doesn't make sense this way.

Batch parallelization is when you run the entire model on each GPU parallel and you split your batch and process it in parallel. This is done via a trick to define a larger batch that will be split and become the desired batch size after split. In this case batch splitting and updating the weights is done on the CPU (in case of Tensorflow) and after 3 GPU any additional GPU has only a marginal improvement on training speeds (as per reports). So here even 4 doesn't make sense and 8 is just crazy. Here is an example of batch parallelization.

Alternately if you are good in coding you may want to have a look at this paper, section 3.5 where it is explained how 8 GPUs were utilized to serve a 4 layer LSTM network. Probably you can do things like this to utilize DGX-1 but as far as I know Tensorflow doesn't support splitting a layer to multiple GPUs. My conclusion is that it's already very hard to utilize 8 GPUs and above that bus speed becomes the bottleneck.


Extension:
I double checked on the bus speed and I was wrong, it should not be a problem. Most of the time consumed in training is computational effort of backpropagation.

Actually PCIe speed is limited by the CPU and the mobo chipset. The CPU has PCIe lanes to be allocated by the mobo. The strongest single CPU at the moment is Broadwell-E with 40 lanes (Skylake rumored to have 44). Mobo allocates that bandwidth either x16 or x8 to PCIe peripherals. So with a 40-lane CPU you can run 2 cards at x16 (2 * 16 = 32 < 40) or 5 cards at x8 (5 * 8 = 40). Here it needs to be mentioned that M.2 also uses PCIe lanes so for the latter option forgot and M.2 drive. A single CPU system will not take 8 GPUs, that's why they need dual CPU in DGX-1. The next limitation is the mobo chipset. The most powerful at the moment is the X99 and the C6 series. X299 will be announced next week if memory serves and probably a C6 replacement follows soon.

Since PCIe speed is not the bottleneck in machine learning the updated answer for your question would be the more GPUs the better. It seems the limit is 5 at x8 using PCIe 3.0, but more than that is rarely needed since layers can't be split between GPUs. (And the deepest NN that still makes sense may be considered as 5-6 layers.)

Manngo
  • 296
  • 1
  • 5