Perform clustering on high dimensional data

Question

Recently I trained a BYOL model on a set of images to learn an embedding space where similar vectors are close by. The performance was fantastic when I performed approximate K-nearest neighbours search.

Now the next task, where I am facing a problem is to find a clustering algorithm that uncovers a set of clusters using the embedding vectors generated by the BYOL trained feature extractor [dimension of the vector is 1024 & there are 1 million vectors]. I have no information apriori about the number of classes i.e. clusters in my dataset & thus cannot use Kmeans. Is there any scalable clustering algorithm that can help me uncover such clusters. I tried to use FISHDBC but the repository does not have good documentation.

score 0 · Answer 1 · answered Jan 18 '22 at 00:07

You can use K-means even without knowing a priori the amount of classes. Take a look at the definition of Silhouette score, it's a generic approach applicable to any clustering method that requires an input value for the final amount of clusters to generate.

The silhouette score represents an average between the inner similarity of each cluster elements and outer dissimilarity between each cluster element and all elements belonging to different clusters. The higher the silhouette score, the better the clusters split the data.

Only drop down of the approach is the necessity to run the clustering algorithm with several different initialization values for n: number of clusters. That's the only way to get a graph like the one show below, in other to understand the optimal value for n (in this fake example 3).

Perform clustering on high dimensional data

1 Answers1