2

Why do large language models (LLMs) need massive distributed training across nodes -- if the models fit in one GPU and larger batch only decreases the variance of gradients?

tldr: assuming for models that don't need sharding across nodes, why do we need (massive) distributed training if the models (e.g. CLIP, Chinchilla, even really large GPTs e.g. CLIP fits in a V100 32GB) fit in one GPU and larger batch only decreases the variance of gradients (but not expose ore tokens or param updates)? A larger batch doesn't necessarily mean we train on "more data/tokens" -- or at least that doesn't seem to be wrt SGD like optimizers.


Intuitively, it feels that if we had a larger batch size then we have more tokens to learn about -- but knowing some theory of optimization and what SGD like algorithms actually do -- a larger batch size only actually decreases the variance of gradients. So to me it's not clear why massie distributed training is needed -- at all unless the model is so large that it has to be shared across nodes. In addition, even if the batch was "huge" -- we can only do a single gradient update.

I feel I must be missing something obvious hence the question given how pervasive massive distributed training is.

In addition some toy training curves with V100s & T5's show me there is very little if any benefit in additional GPUs enter image description here

In addition, it seems from nonGPT we know small batch sizes are sufficient to train (reference https://github.com/karpathy/nanoGPT but I did ask Karpathy directly to confirm https://github.com/karpathy/nanoGPT/issues/58).

I am missing something obvious, but I wanted to clear this up in my head since it seems to be a foundation thing in training foundation models.

Related to the previous, I've also been unsure about the role of the batch size in training LLMs compared to traditional deep learning. In traditional deep learning when we used epochs to train, a model the larger the batch size the quicker we could go through an epoch -- so the advice I received (e.g. approximate advice by Ruslan Salakhutdinov's at the Simon's institute for deep learning tutorials) was to make the batch size large. Intuitively, the larger the batch size the more data the model sees per iteration. But mathematically this only really improves the variance of the gradient -- which isn't immediately obvious is what we want (I've done experiments and seen papers where noisy gradients lead to better models). It is clear too the that the larger the context size the better (for everything, but for the sake of this conv it's better for training) -- whenever possible. But context size is totally different from batch size. So my question is, how does distributed training, especially at the node level help at all if batch size isn't really the helping factor (which might be a wrong assumption)? So the only role for distributed training I see is if the model is to large to fit in 1 node -- since I'm arguing there is no point to make the batch size too large (I'd guess 64-32 is fine due to the CLT).

What am I missing? Empirical answers are fine! Or any answers are fine!


Related:

Charlie Parker
  • 161
  • 2
  • 5

2 Answers2

3

I don’t think the problem lies in the gradient or related stuffs - the problem here is the hardware limitation of GPU VRAM.

Sure, CLIP can fit in a single GPU, but what GPU are we talking about? I have done some experiments with CLIP and I am pretty sure that CLIP ViT-B/16 with 224x224 images and batch size above 40, it is very easy to get out-of-memory on RTX 3090.

You can try it on Google Colab. Try with ViT-B/16 and input a random tensor with size 224x244x3, then gradually scale up the batch size to see which threshold it exceeds the VRAM of the given GPU.

So in short, it is more about "we cannot fit it in a single GPU" rather than "It is about training stability".

Minh-Long Luu
  • 1,120
  • 2
  • 20
1

The large batch size is mostly to speed things up as you point out. You could in principle train these models with batch size 1 and just do gradient accumulation, but then training would take muuuuch longer.

Aside for tokens/seconds, empirically, it has been found that larger batch sizes lead to faster convergence (https://arxiv.org/abs/2212.14034)

The time and efficiency advantages provided by large batch sizes will often necessitate distributed training, as Minh points out, due to the VRAM limitations of individual GPUs. Good luck even just loading (no inference) anything larger than 15B on an A100 without optimization tricks.

It also doesn’t help that transformer’s self-attention mechanism has memory requirements that are quadratic with the input sequence length. When you’re training these models, you tend to pack several examples into a single input sequence such that the maximum length is reached, minimizing padding tokens and maximizing GPU usage (which would otherwise remain underutilized, which is a waste of time/money). The result of that is that you’re training with fully maxed-out sequence lengths, and hence maxing-out memory requirements, once again requiring distributed training typically.

thesofakillers
  • 309
  • 2
  • 14