Which metric should I use to assess the quality of the clusters?

Question

I have a model that outputs a latent N-dimensional embedding for all data points, trained in a way that clusters data-points from the same class together, while being separated from other clusters belonging to other different classes.

The N-dimensional embedding is projected down to 2D using UMAP. At each epoch, I wish to test the clustering capability of the model on these 2D projections for use as validation accuracy. I have the labels for each class.

How should I proceed?

score 2 · Answer 1 · edited May 13 '22 at 08:26

You can compute Silhouette Coefficient for your aim. Its values mean:

1: Means clusters are well apart from each other and clearly distinguished.

0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.

-1: Means clusters are assigned in the wrong way.

Other measures, such as purity and mutual information, are also possible by computing

an external criterion that evaluates how well the clustering matches the gold standard classes

score 1 · Answer 2 · edited May 13 '22 at 08:27

1

One more popular metric for this is the Davies Bouldin Score.

You can also take a look at the clustering metrics in scikit documentation.

edited May 13 '22 at 08:27

nbro

39,006
12
98
176

answered Feb 10 '21 at 23:19

Abhishek Verma

858
3
6

Which metric should I use to assess the quality of the clusters?

2 Answers2