Does this modified version of the triplet loss function introduced with SBERT that uses the cosine similarity make sense?

Question

I am working on a modified version of the triplet loss function introduced with SBERT, where instead of the Euclidean distance we use the cosine similarity. The formula to minimize is max( (|s_a*s_p| / |s_a|*|s_p|) - (|s_a*s_n| / |s_a|*|s_n|) + e, 0) where s_a is the embedding of the anchor sentences (the context), s_p is the embedding of the positive sentence (correct continuation) and s_n is the embedding of the negative sentence (wrong continuation).

I would like to check that the function I came up with makes sense from a theoretical point of view. Where should I look to check which features a loss function should satisfy?

Motivation for the question: I'm getting my hands dirty with contrastive loss functions, and this is an easy variation I came up with.

You should share with us the loss function (the formula) that you came up with. Maybe you should also explain how it's different from the original loss and why we came up with this new one. — nbro, Apr 04 '22 at 08:52
Edit your post to include this info directly there. Note that you can use mathjax on this site. — nbro, Apr 04 '22 at 09:12
I'm getting my hands dirty with contrastive loss functions, and this is an easy variation I came up with. — albus_c, Apr 04 '22 at 09:12
I modified the title to be more specific and to be the question that I think you're asking. Make sure that's the case. Again, I would highly recommend that you use mathjax to format the loss function. — nbro, Apr 04 '22 at 09:22

score 1 · Accepted Answer · answered Apr 04 '22 at 11:53

A Loss function is just a function with a minimum.

In machine learning though, we also require the loss function to be differentiable, otherwise no backpropagation and hence no weight updating. Moreover basically every deep learning library relies on autograd, so if the function is not differentiable your code will simply crash.

Stronger but not compulsory condition could be Lipschitz continuity, i.e. ensuring that the function decrease at a constant rate. Intuitively, a loss function should output high values for big differences between predictions/targets and small values for small changes, otherwise the update of the weights will risk to be too big (no convergence) or too small (easily stuck on local minima).

Regarding your loss, the only issue I see is that you're replacing a proper metric, i.e. euclidean distance, with a function that is not a metric, i.e. cosine similarity (which does not respect the triangular inequality, hence it's not a metric). So I would be careful and test what kind of values you get with some dummy data to understand if it still behave as a proper loss.

Thanks heaps! That's exactely what I was looking for. – albus_c Apr 05 '22 at 02:32 — albus_c, Apr 05 '22 at 02:32

Does this modified version of the triplet loss function introduced with SBERT that uses the cosine similarity make sense?

1 Answers1