In order to learn the embeddings, we need to train a model based on some objective function. The model can be an RNN and the objective function can be the likelihood. We learn the embeddings by calculating the likelihood, and the embeddings are considered good if the likelihood is maximum for them.
The following paragraph says that it is difficult to scale RNN to estimate the maximum likelihood for large corpus due to scaling issues:
Likelihood-based optimization is derived from the objective $\log p(w; U)$, where $U \in R_{K \times V}$ is matrix of word embeddings, and $w =\{w_m \}_{m=1}^M$ is a corpus, represented as a list of $M$ tokens. Recurrent neural network language models optimize this objective directly, backpropagating to the input word embeddings through the recurrent structure. However, state-of-the-art word embeddings employ huge corpora with hundreds of billions of tokens, and recurrent architectures are difficult to scale to such data. As a result, likelihood-based word embeddings are usually based on simplified likelihoods or heuristic approximations.
What is the type of scaling, wrt RNN, is referred to here? Why is it difficult to scale RNN?
The paragraph above is taken from the page 329 of Chapter 14: Distributional and distributed semantics of the textbook Natural Language Processing by Jacob Eisenstein