2

If I understand correctly, when training language models, we take a document and then chunk the document into a sequences of k tokens. So if the document is of length 30 and k=10, then we'll have 20 chunks of 10 tokens each (token 1-11, 2-12, and so on).

However these training sequences are not iid, right? If so, are there any papers that try and deal with this?

Opt
  • 121
  • 2

1 Answers1

1

Language models explicitly assume that word sequences are not independent and identically distributed (iid). A word-based model that assumed iid within each sequence could only predict probabilities of words according to some other context than surrounding words, which would not be very useful for a language model.

The training processes for statistical models often require iid input/output pairs when sampling e.g. minibatches. Neural networks definitely train better when datasets are shuffled to remove any chance of correlation between examples as they are used to update parameters.

How to resolve these two different needs? When training a sequence-based model on many sequences, it is the distribution of sequences that needs to be iid and representative of the overall population. The distribution of items within the sequence is what is being learned, so should not be obscured or removed.

As an analogy, you do not usually want to shuffle the rows of pixels in an image when training image classifiers. The spatial pattern in the image needs to be preserved in the same way that the sequential pattern in a sentence needs to be preserved, because the pattern is part of the data being modelled. With image data, you accept non-iid relationships between the pixels that are next to each other within a single image, then apply shuffling and stratifying algorithms at the level of individual images.

If so, are there any papers that try and deal with this?

There may be some original papers from 50 years ago which compare iid with non-iid data when training sequences fed into RNNs, but it has been a standard part of engineering practice for decades now to shuffle datasets, and the separate sequences used to train RNNs are no different.

From your comments:

However the sequences that are sampled for training are not iid if they are multiple sequences generated sequentially from the same document, which from what I understand happens often?

You are correct, the raw set of sequences is not iid if they are collected in that fashion. However, the dataset is always shuffled or resampled for training purposes, it is not fed into training routines in that raw state. The shuffling (of selected sequences which are kept intact internally) happens in-between the raw data collection and the training.


There are some simple statistical models that do not require iid data to train on. This occurs for example in tabular reinforcement learning, which can learn online from a single continuous sequence of states, actions and rewards. An equivalent language model would be word- or letter-based ngrams.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • 1
    I agree that LMs for a single sequence assume non-iid. However the sequences that are sampled for training are *not* iid if they are multiple sequences generated sequentially from the same document, which from what I understand happens often? – Opt Mar 11 '21 at 20:08
  • @Opt: You keep the sequences intact internally, but shuffle the collection of sequences as you would any other dataset that you need to be iid. This is separate to how the data might be *collected*. It is fine to collect them in order from source documents, the shuffling comes later, i.e. in between collecting the data and training with it – Neil Slater Mar 11 '21 at 21:49
  • Shuffling necessarily doesn't give iid - see https://math.stackexchange.com/questions/2965518/does-the-shuffling-of-a-sequence-of-measurements-produce-an-i-i-d-sequence – Opt Mar 12 '21 at 02:34
  • *doesn't necessarily – Opt Mar 12 '21 at 02:42
  • @Opt: The results in that question are edge cases that are not going to be relevant in most machine learning (including statistical language models), even if they are interesting mathematically. – Neil Slater Mar 12 '21 at 07:21
  • That's not true -- suppose you split dataset into train/validation and use validation set for early stopping etc. Suppose this is the sentiment analysis task. Then if it's not iid, training data having proportionally more positive sentiment than the overall data will mean validation has proportionally less which lead to suboptimal stopping time, etc. – Opt Mar 13 '21 at 17:42