I am looking to compute the similarity between a large set of vectors during neural network training - a process that is considerably expensive when choosing the wrong metric. So far, I am making use of cosine similarity, but I found that the training gets slowed down by several orders of magnitude because of that. Each batch, I compute batch_size^2 similarities of vectors with 9000 entries containing non-sparse floating-point numbers. Is there an alternative metric to cosine similarity that is much faster but also differentiable?
1 Answers
You probably are not going to get much faster than cosine similarity, the main reason why your calculations are slow is the fact that you execute your operation batch_size
$^2$ times which results in a runtime complexity of $O(N^2)$ no matter which metric you choose. However, just to test how much of an improvement you will get by a simpler metric, you might just try the $\infty$-norm: $d(u, v) = \max_i | u_i - v_i |$ or the dot product (i.e. cosine similarity without normalizing the vectors). Just to see whether a simpler metric will actually provide you significant performance improvements.
If you then find out that its not the metric, you can check out Linformer by Wang et al.. As you might know the Transformer architecture also compares N vectors pairwisely. Wang et al. deviced a method how to get memory and time complexity from $O(N^2)$ to $O(N)$. This fits your problem exactly, so maybe it is a good read for you.

- 1,501
- 5
- 11
-
Sampling is also a good idea if you can afford missing out on some comparisons - I think the sampling strategy would depend on your usecase. Do you maybe have a prior on the vector similarities which you could use to speed up the process? Can you provide a bit more information on what exactly you want to do? maybe something else comes to mind – Chillston May 23 '23 at 18:46
-
Firstly, nice summary of what you are planning - As this is only a sort of regularization term, I think you should easily get away with sampling in this case. However, by minimizing the pairwise distance between the latent vectors, all vectors should move closer together and ultimately approach some point $p$, right? – Chillston May 24 '23 at 11:29
-
I wonder if you would get similar behavior by just taking the euclidean vector norms instead of the pairwise stuff. If you do that, all vectors should approach the $0$-vector and thus also move closer together. So in the end, the effect should be the same(?). Doing that would also be linear instead of $N^2$. However, I am not sure I completely understand the idea in general: It's not super intuitive to me why term should promote disentanglement. Since you worked with VAEs, you probably know about $\beta$-VAE? Apparently it generates more disentangled representations. – Chillston May 24 '23 at 11:29