Background
I'm implementing the DBScan algorithm. I have trained it to cluster a small dataset of random clusters, and want to be able to get a decimal for its accuracy of clustering the groups.
Motivation
This is for some simple unittesting that checks it can cluster basic, separate classes (i.e. checking in my CI/CD that the accuracy is appropriately high).
The Problem
The problem is that it's outputs are its own cluster-indexes (the classes it's discovered/chosen). These do not align with the original integer labels I generated earlier.
An Example to Clear Things Up
Say, I have some original labels:
y_true = np.array([0, 1, 1, 0, 2, 1, 2])
and I have some cluster-predicted outputs
y_predicted = np.array([1, 0, 0, 1, 2, 0, 2])
You can see it has clustered values correctly, however they don't align with the original y_true
array values. Therefore we can't use the normal accuracy function of np.mean(y_true == y_predicted)
.
The Challenge
Whilst preserving the outlier class of -1
, how can I check that the model is significantly accurate, despite the fact that its generated classes do not align with original inputs. I understand that clustering algorithms, in practice are not used in this way but this is for testing that my implementation is proper, in my CI/CD automated testing.