In the deep learning specialization course by Andrew Ng, in the video Sequence Models (minute 4:13), he says that in negative sampling we have to choose a sample of words from the corpus to train rather than choosing the whole corpus. But he said that, for smaller datasets, we need a bigger number of samples, for example, 5-20, and, for larger datasets, we need a smaller sample, for example, 2-5. By sample, I am referring to the number of words along with the target word we have taken to train the model.
Why do small datasets require more samples, while big datasets require fewer samples?