I'm reading the notes here and have a doubt on page 2 ("Least squares objective" section). The probability of a word $j$ occurring in the context of word $i$ is $$Q_{ij}=\frac{\exp(u_j^Tv_i)}{\sum_{w=1}^W\exp(u_w^Tv_i)}$$
The notes read:
Training proceeds in an on-line, stochastic fashion, but the implied global cross-entropy loss can be calculated as $$J=-\sum_{i\in corpus}\sum_{j\in context(i)}\log Q_{ij}$$ As the same words $i$ and $j$ can occur multiple times in the corpus, it is more efficient to first group together the same values for $i$ and $j$: $$J=-\sum_{i=1}^W\sum_{j=1}^WX_{ij}\log(Q_{ij})$$
where $X_{ij}$ is the total number of times $j$ occurs in the context of $i$ and the value of co-occuring frequency is given by the co-occurence matrix $X$. This much is clear. But then the author states that the denominator of $Q_{ij}$ is too expensive to compute, so the cross entropy loss won't work.
Instead, we use a least square objective in which the normalization factors in $P$ and $Q$ are discarded: $$\hat J=\sum_{i=1}^W\sum_{j=1}^WX_i(\hat P_{ij}-\hat Q_{ij})^2$$ where $\hat P_{ij}=X_{ij}$ and $\hat Q_{ij}=\exp(u_j^Tv_i)$ are the unnormalized distributions.
$X_i=\sum_kX_{ik}$ is the number of times any word appears in the context of $i$. I don't understand this part. Why have we introduced $X_i$ out of nowhere? How is $\hat P_{ij}$ "unnormalized"? Is there a tradeoff in switching from softmax to MSE?
(As far as I know, softmax made total sense in skip gram because we were calculating scores corresponding to different words (discrete possibilities) and matching the predicted output to the actual word - similar to a classification problem, so softmax makes sense.)