What's the most efficient way of performing batched training of Causal Language Models?

Question

I have seen a number of ways to train (yes, train, not fine-tune) these models efficiently with batches. I will illustrate these techniques with the following example dataset and context window:

Context window:
   -----------------
Data samples:
1. ###
2. ################
3. ####
4. ##############
5. ########
6. #########

Suppose we have a batch size of 2. Our pad token is x

First technique: Vanilla Padding

Context window:
   -----------------
batch 1:
1. ###xxxxxxxxxxxxx
2. ################

batch 2:
3. ####xxxxxxxxxx
4. ##############

batch 3: 
5. ########x
6. #########

Second technique: Bucketed Padding

Samples of similar lengths are batched together to minimise the number of pad tokens

Context window:
   -----------------
batch 1:
1. ###x
3. ####

batch 2:
2. ################
4. ##############xx

batch 3: 
5. ########x
6. #########

this is uniform length batching described in this blogpost and referred to as bucketed random sampling in this paper.

Third technique: Concatenating samples

In this technique, we concatenate samples, separating them with a EOS token (E) until they reach the context length. In this way, we have no padding tokens, and the entire context length is used. The attention mask keeps track of where the EOS tokens occur.

Context window:
   -----------------
batch 1:
   ###E############# (1 and part of 2)
batch 2:
   ###E####E######## (rest of 2, 3 and part of 4)
batch 3:
   ######E########E# (rest of 4, 5, part of 6)
batch 4:
   ######## (rest of 6)

This technique is referenced at 2:28 of this video from this huggingface tutorial.

With this technique, we reduce the number of batches, and only have to pad the final batch if necessary. However, it is unclear to me whether this is "allowed" for causal language modelling, as it is unclear whether this will cause the causal attention mechanism to attend to tokens from previous samples, only ignoring the EOS token (instead of everything before it)

Of these 3 techniques, which is the most memory efficient? Which is the most commonly used?

didn't you just say, 10 lines above, that the attention mask was used to block attention to previous samples? — user253751, Mar 28 '23 at 13:03
I wonder what is actually the difference between the batch and sequence dimension other than the attention mask being unable to attend across the batch dimension. Maybe it makes sense to concatenate *all* the samples. — user253751, Mar 28 '23 at 13:04
@user253751 I don't recall saying that. The attention mask only keeps track of where the EOS token is, not whether to conditionally attend to tokens before it. You could indeed concatenate all the samples but eventually you'd run into the max sequence length afforded by the transformer (the context window size, e.g. 1024 tokens) so you'd have to resort to batching. — thesofakillers, Mar 28 '23 at 14:30
but what *is* the context window size, other than the maximum size of your attention buffer (probably the practical limitation) and the number of positional embeddings you have? As long as your attention data fits in your buffers and you restart the positional embedding sequence for each input... — user253751, Mar 28 '23 at 14:34
"whether to conditionally attend to tokens before it" is the whole point of the attention mask, isn't it? — user253751, Mar 28 '23 at 14:35
indeed it's the point of the attention mask, but it shouldn't be attending to tokens before the EOS token delimiting the previous sample, since those are not technically preceeding the tokens of the current sample. — thesofakillers, Mar 28 '23 at 14:36
indeed so it seems you could make an attention mask where it doesn't, and then there is no difference between concatenating samples or not concatenating them. The attention mask I've seen represented as a matrix of 0 (no attention) or 1 (attention) so you'd just need to have one with 0s in the right places — user253751, Mar 28 '23 at 14:39
The issue is that I would have to make a separate attention mask for each sample to do that, which defeats the purpose of concatenating them in the first place. It's because of this issue that I am unsure whether this technique (suggested by huggingface) is actually valid for CLM training. — thesofakillers, Mar 28 '23 at 14:41
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/144920/discussion-between-thesofakillers-and-user253751). — thesofakillers, Mar 28 '23 at 14:53

What's the most efficient way of performing batched training of Causal Language Models?

First technique: Vanilla Padding

Second technique: Bucketed Padding

Third technique: Concatenating samples

0 Answers0