I have seen a number of ways to train (yes, train, not fine-tune) these models efficiently with batches. I will illustrate these techniques with the following example dataset and context window:
Context window:
-----------------
Data samples:
1. ###
2. ################
3. ####
4. ##############
5. ########
6. #########
Suppose we have a batch size of 2. Our pad token is x
First technique: Vanilla Padding
Context window:
-----------------
batch 1:
1. ###xxxxxxxxxxxxx
2. ################
batch 2:
3. ####xxxxxxxxxx
4. ##############
batch 3:
5. ########x
6. #########
Second technique: Bucketed Padding
Samples of similar lengths are batched together to minimise the number of pad tokens
Context window:
-----------------
batch 1:
1. ###x
3. ####
batch 2:
2. ################
4. ##############xx
batch 3:
5. ########x
6. #########
this is uniform length batching described in this blogpost and referred to as bucketed random sampling in this paper.
Third technique: Concatenating samples
In this technique, we concatenate samples, separating them with a EOS token (E) until they reach the context length. In this way, we have no padding tokens, and the entire context length is used. The attention mask keeps track of where the EOS tokens occur.
Context window:
-----------------
batch 1:
###E############# (1 and part of 2)
batch 2:
###E####E######## (rest of 2, 3 and part of 4)
batch 3:
######E########E# (rest of 4, 5, part of 6)
batch 4:
######## (rest of 6)
This technique is referenced at 2:28 of this video from this huggingface tutorial.
With this technique, we reduce the number of batches, and only have to pad the final batch if necessary. However, it is unclear to me whether this is "allowed" for causal language modelling, as it is unclear whether this will cause the causal attention mechanism to attend to tokens from previous samples, only ignoring the EOS token (instead of everything before it)
Of these 3 techniques, which is the most memory efficient? Which is the most commonly used?