2

We represent some catalogue items (documents, music tracks, videos, whatever) as vectors embedded in R^d and use them to retrieve nearest neighbours to users query. The typical scenario is that users can input any query and the search results are from a fixed user-catalogue.

I noticed experimentally that certain results occur much more frequent among k-nearest neighbours than other results. The majority of documents is never found among nearest neighbours of any other document from the catalogue, while a few are found many times. In other words, the distribution of "how many times a document can be found" is extremely skewed.

It doesn't matter if I use euclidean or cosine similarity or dot product, and it doesn't depend on the catalogue. It seems to depend on the embedding dimension, however.

I tried to reproduce the problem with random data embedded in R^d for various d, and measure the distribution of "how many times a vector is among 10 nearest neighbours of some other point". For low dimensions (up to 10) the distribution is close to normal centered at 10. However, with increasing dimension it's getting very skewed. Hubs (points that occur often in k-NN search) are always those points that are nearer to cluster center.

I asked about this on math stackexchange without any response there. The only comments I got there was "nearest neighbors do not make sense in high dimension, everything is too far from anything else" which is true but not useful.

I found this paper which recommends (in Chapter 5) to mitigate this problem so that any data point that is a hub (occurs often among k-nearest neighbour search) should be punished, in a similarity search, by some discount. This may work, but looks more like a hack than a systematic approach.

So here is my question:

  • is there some standard method how to aproach it? Maybe try to minimize the embedding dimension, assuming some other metrics are good enough?
  • are there some other common practices to ensure that customers don't always get the same (even though good) results?

Thanks for any hints or references.

Peter Franek
  • 432
  • 1
  • 4
  • 11
  • 1
    I cannot answer your question why this is happening, hence only a comment. But for high dimensions, I recommend HNSW or Facebook's FAISS. I've used HNSW myself and appreciated how easily you can trade off accuracy (which is pretty high anyways) for runtime by adjusting simple parameters. Their demonstration on github is pretty easy to follow: https://github.com/nmslib/hnswlib You could also consider combining KNN with TF-IDF, such as here: https://stackoverflow.com/questions/59057351/knn-for-text-classification-using-tf-idf-scores – postnubilaphoebus May 23 '23 at 21:12
  • 1
    The idea of including TF-IDF being that you get rid of "uninteresting" similarities by counterweighing each term-frequency by its inverse document frequency. Finally, perhaps consider using an autoencoder for information retrieval, athough that is certainly trickier. – postnubilaphoebus May 23 '23 at 21:20
  • 1
    High-dimensional spaces are very counter-intuitive. https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/99191#99191 – Sycorax May 29 '23 at 16:37

0 Answers0