Language models explicitly assume that word sequences are not independent and identically distributed (iid). A word-based model that assumed iid within each sequence could only predict probabilities of words according to some other context than surrounding words, which would not be very useful for a language model.
The training processes for statistical models often require iid input/output pairs when sampling e.g. minibatches. Neural networks definitely train better when datasets are shuffled to remove any chance of correlation between examples as they are used to update parameters.
How to resolve these two different needs? When training a sequence-based model on many sequences, it is the distribution of sequences that needs to be iid and representative of the overall population. The distribution of items within the sequence is what is being learned, so should not be obscured or removed.
As an analogy, you do not usually want to shuffle the rows of pixels in an image when training image classifiers. The spatial pattern in the image needs to be preserved in the same way that the sequential pattern in a sentence needs to be preserved, because the pattern is part of the data being modelled. With image data, you accept non-iid relationships between the pixels that are next to each other within a single image, then apply shuffling and stratifying algorithms at the level of individual images.
If so, are there any papers that try and deal with this?
There may be some original papers from 50 years ago which compare iid with non-iid data when training sequences fed into RNNs, but it has been a standard part of engineering practice for decades now to shuffle datasets, and the separate sequences used to train RNNs are no different.
From your comments:
However the sequences that are sampled for training are not iid if they are multiple sequences generated sequentially from the same document, which from what I understand happens often?
You are correct, the raw set of sequences is not iid if they are collected in that fashion. However, the dataset is always shuffled or resampled for training purposes, it is not fed into training routines in that raw state. The shuffling (of selected sequences which are kept intact internally) happens in-between the raw data collection and the training.
There are some simple statistical models that do not require iid data to train on. This occurs for example in tabular reinforcement learning, which can learn online from a single continuous sequence of states, actions and rewards. An equivalent language model would be word- or letter-based ngrams.