1

I'm trying to understand few details about NT-Xent loss defined in SimCLR paper(link). The loss is defined as

$$\mathcal{l}_{i,j} = -\log\frac{\exp(sim(z_i,z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{[k\neq i]} \exp(sim(z_i,z_k)/\tau)}$$

Where $z_i$ and $z_j$ represent two augmentations for the same image. What I don't understand is: at the denominator, I understand that we want to exclude the point $z_i$ using the indicator function but shouldn't we exclude also $z_j$? Otherwise we will have $k=j$ for some $k$. Essentially, why we do let the positive sample at the denominator?

James Arten
  • 297
  • 1
  • 8

1 Answers1

3

To train the model, you take $N$ samples then build $N$ pairs by means of applying two different augmentations. So the total num. of samples is $2N$. Now, you take one pair $(i,j)$ and consider that to be positive. The remaining $N-1$ pairs (so $2(N-1)=2N-2$ samples) are negatives.

For each positive pair $(i,j)$ you compute the similarity of the positive pair at the numerator, and normalize it by the sum of similarities between the anchor $i$ and the negative $k$: if you reason on pairs, instead of single samples, you have that $k\neq j$. Maybe in the paper is not so clear, but $i$ and $j$ are positive because they are two different views of the same input sample (say $x$), instead $(i, k)$ and $(k, j)$ are negatives because $k$ is a view from another input (say $x'$, where $x\neq x'$.) So, when computing the denominator you do not include $j$. I found a blog post that explains that too.

Lastly, the loss is also computed on $(j,i)$ pairs, so you sum it and divide by 2 times the batch size $N$.

Luca Anzalone
  • 2,120
  • 2
  • 13
  • mmmh ok but to have normalized probabilities shouldn't I also include the index $i$ itself? Thanks for your answer, but could you elaborate a bit more please? I would like to have this point clear. – James Arten May 30 '23 at 15:56
  • @JamesArten I've edited my answer, I hope is clearer now – Luca Anzalone May 30 '23 at 17:54