We represent some catalogue items (documents, music tracks, videos, whatever) as vectors embedded in R^d and use them to retrieve nearest neighbours to users query. The typical scenario is that users can input any query and the search results are from a fixed user-catalogue.
I noticed experimentally that certain results occur much more frequent among k-nearest neighbours than other results. The majority of documents is never found among nearest neighbours of any other document from the catalogue, while a few are found many times. In other words, the distribution of "how many times a document can be found" is extremely skewed.
It doesn't matter if I use euclidean or cosine similarity or dot product, and it doesn't depend on the catalogue. It seems to depend on the embedding dimension, however.
I tried to reproduce the problem with random data embedded in R^d for various d, and measure the distribution of "how many times a vector is among 10 nearest neighbours of some other point". For low dimensions (up to 10) the distribution is close to normal centered at 10. However, with increasing dimension it's getting very skewed. Hubs (points that occur often in k-NN search) are always those points that are nearer to cluster center.
I asked about this on math stackexchange without any response there. The only comments I got there was "nearest neighbors do not make sense in high dimension, everything is too far from anything else" which is true but not useful.
I found this paper which recommends (in Chapter 5) to mitigate this problem so that any data point that is a hub (occurs often among k-nearest neighbour search) should be punished, in a similarity search, by some discount. This may work, but looks more like a hack than a systematic approach.
So here is my question:
- is there some standard method how to aproach it? Maybe try to minimize the embedding dimension, assuming some other metrics are good enough?
- are there some other common practices to ensure that customers don't always get the same (even though good) results?
Thanks for any hints or references.