2

I have a dataset with a lot of binary categorical features and a single continuous target value. I would like to cluster them, but I am not quite sure what to use.

In the past, I have used DBSCAN for something similar and it worked well, but that dataset also had lots of continuous features.

Do you have any tips and suggestions?

Would you suggest matrix factorization and then cluster?

Oliver Mason
  • 5,322
  • 12
  • 32
user199590
  • 125
  • 6
  • 1
    By "binary categorical (one-hot encoded)", do you mean that you have $n$ features, and, for each input, only one of them is $1$ and all the others are $0$? – nbro May 15 '22 at 15:59
  • @nbro thanks for the comment! No, I have n features and any combination of 0, 1s among them is possible. i.e. if n = 3, I may have, 000 or 010, or 111, etc. – user199590 May 15 '22 at 16:02
  • 1
    Ok, then I don't think that's called one-hot encoded. 1-hot encoded is what I described, I think. – nbro May 15 '22 at 16:07
  • @nbro ah, apologies, I meant to convey that the individual features can only be either 0 or 1. I'll edit. – user199590 May 15 '22 at 16:08

1 Answers1

1

Any clustering algorithm should work -- the main issue is the similarity or distance metric that determines how similar (or different) two elements are. This is often something like Euclidean distance, but that won't work well with binary data.

I would suggest using the Jaccard Index or Dice Coefficient. These will be suitable for use as a metric when clustering such data.

Oliver Mason
  • 5,322
  • 12
  • 32