What clustering algorithms work best for datasets with only binary categorical features?

Question

I have a dataset with a lot of binary categorical features and a single continuous target value. I would like to cluster them, but I am not quite sure what to use.

In the past, I have used DBSCAN for something similar and it worked well, but that dataset also had lots of continuous features.

Do you have any tips and suggestions?

Would you suggest matrix factorization and then cluster?

By "binary categorical (one-hot encoded)", do you mean that you have $n$ features, and, for each input, only one of them is $1$ and all the others are $0$? — nbro, May 15 '22 at 15:59
@nbro thanks for the comment! No, I have n features and any combination of 0, 1s among them is possible. i.e. if n = 3, I may have, 000 or 010, or 111, etc. — user199590, May 15 '22 at 16:02
Ok, then I don't think that's called one-hot encoded. 1-hot encoded is what I described, I think. — nbro, May 15 '22 at 16:07
@nbro ah, apologies, I meant to convey that the individual features can only be either 0 or 1. I'll edit. — user199590, May 15 '22 at 16:08

score 1 · Accepted Answer · answered May 16 '22 at 08:06

1

Any clustering algorithm should work -- the main issue is the similarity or distance metric that determines how similar (or different) two elements are. This is often something like Euclidean distance, but that won't work well with binary data.

I would suggest using the Jaccard Index or Dice Coefficient. These will be suitable for use as a metric when clustering such data.

answered May 16 '22 at 08:06

Oliver Mason

5,322
12
32

Hey Oliver, thanks a lot! I'll try and experiment with using these distance metrics. – user199590 May 17 '22 at 05:47

What clustering algorithms work best for datasets with only binary categorical features?

1 Answers1