2

I'm reading the notes here and have a doubt on page 2 ("Least squares objective" section). The probability of a word $j$ occurring in the context of word $i$ is $$Q_{ij}=\frac{\exp(u_j^Tv_i)}{\sum_{w=1}^W\exp(u_w^Tv_i)}$$

The notes read:

Training proceeds in an on-line, stochastic fashion, but the implied global cross-entropy loss can be calculated as $$J=-\sum_{i\in corpus}\sum_{j\in context(i)}\log Q_{ij}$$ As the same words $i$ and $j$ can occur multiple times in the corpus, it is more efficient to first group together the same values for $i$ and $j$: $$J=-\sum_{i=1}^W\sum_{j=1}^WX_{ij}\log(Q_{ij})$$

where $X_{ij}$ is the total number of times $j$ occurs in the context of $i$ and the value of co-occuring frequency is given by the co-occurence matrix $X$. This much is clear. But then the author states that the denominator of $Q_{ij}$ is too expensive to compute, so the cross entropy loss won't work.

Instead, we use a least square objective in which the normalization factors in $P$ and $Q$ are discarded: $$\hat J=\sum_{i=1}^W\sum_{j=1}^WX_i(\hat P_{ij}-\hat Q_{ij})^2$$ where $\hat P_{ij}=X_{ij}$ and $\hat Q_{ij}=\exp(u_j^Tv_i)$ are the unnormalized distributions.

$X_i=\sum_kX_{ik}$ is the number of times any word appears in the context of $i$. I don't understand this part. Why have we introduced $X_i$ out of nowhere? How is $\hat P_{ij}$ "unnormalized"? Is there a tradeoff in switching from softmax to MSE?

(As far as I know, softmax made total sense in skip gram because we were calculating scores corresponding to different words (discrete possibilities) and matching the predicted output to the actual word - similar to a classification problem, so softmax makes sense.)

Shirish Kulhari
  • 383
  • 1
  • 10
  • In the notes, the author specifically says " it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary", so I think you should also have said this, because it was not clear what the _denominator_ is in your context. So, the denominator is the normalization factor. – nbro Sep 17 '19 at 17:20
  • I think I have the answer to your question: _How is $\hat P_{ij}$ "unnormalized"?_. In section 1.2 of the notes, the author defines $P_{ij}$ as the fraction between $X_{ij}$ and $X_i$, so I suppose that $\hat P_{ij}$ is just $X_{ij}$ (only the numerator), which is _number of times word $j$ occur in the context of word $i$_. – nbro Sep 17 '19 at 17:25
  • @nbro: Ah you're absolutely right. Careless of me not to see that. But I'm still not sure why $X_i$ comes in – Shirish Kulhari Sep 17 '19 at 17:36

0 Answers0