0

Background

I'm implementing the DBScan algorithm. I have trained it to cluster a small dataset of random clusters, and want to be able to get a decimal for its accuracy of clustering the groups.

Motivation

This is for some simple unittesting that checks it can cluster basic, separate classes (i.e. checking in my CI/CD that the accuracy is appropriately high).

The Problem

The problem is that it's outputs are its own cluster-indexes (the classes it's discovered/chosen). These do not align with the original integer labels I generated earlier.

An Example to Clear Things Up

Say, I have some original labels:

y_true = np.array([0, 1, 1, 0, 2, 1, 2])

and I have some cluster-predicted outputs

y_predicted = np.array([1, 0, 0, 1, 2, 0, 2])

You can see it has clustered values correctly, however they don't align with the original y_true array values. Therefore we can't use the normal accuracy function of np.mean(y_true == y_predicted).

The Challenge

Whilst preserving the outlier class of -1, how can I check that the model is significantly accurate, despite the fact that its generated classes do not align with original inputs. I understand that clustering algorithms, in practice are not used in this way but this is for testing that my implementation is proper, in my CI/CD automated testing.

  • After sorting `y_true` and `y_pred` and take the average of absolute difference between `y_true[i], y_pred[i]`, assuming the cluster's have similar sizes i.e., `len(y_true) == len(y_pred)`? – shaik moeed Aug 07 '23 at 19:13
  • Thank you, I'll try that now! – SamTheProgrammer Aug 07 '23 at 19:15
  • However, given that the clusters often have similar sizes, could this be a problem?@shaikmoeed The numpy code I've used is `np.mean(np.sort(y_true) == np.sort(y_prediction))` to give a value from 0 to 1 – SamTheProgrammer Aug 07 '23 at 19:15
  • I have found that this doesn't quite work. ```y1 = [1, 0, 2, 0, 1]; y2 = [2, 1, 0, 1, 2]``` The value ends up as **0.6** rather than **1** as it should be here. – SamTheProgrammer Aug 07 '23 at 19:24
  • Yeah, that's true. Your requirement seems to be close with [`adjusted_rand_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn-metrics-adjusted-rand-score). – shaik moeed Aug 07 '23 at 19:35
  • It's an appropriate metric, however I can't find many mathematical explanations of it. Do you know of any good resources? Also, is there any way to get an accuracy metric from 0 to 1? – SamTheProgrammer Aug 08 '23 at 14:38

0 Answers0