Clustering of very high dimensional data and large number of examples without losing info in dimensions

Question

I'm trying to get a grasp on scalability of clustering algorithms, and have a toy example in mind. Let's say I have around a million or so songs from $50$ genres. Each song has characteristics - some of which are common across all or most genres, some of which are common to only a few genres and some of which are genre-specific.

Common attributes could be something like song duration, artist, label, year, album, key, etc. Genre-specific attributes could be like lead guitarist, trombone player, conductor, movie name (in case of movie soundtracks), etc. Assume that there are, say, $2000$ attributes across all possible genres.

The aim is to identify attributes that characterize subgenres of these genres. So of course, let's say for rock I can just collect all the attributes for all rock songs, but even that set of attributes may be too broad to characterize the rock genre - maybe there are some that are specific to subgenres and so I won't have the desired level of granularity.

Note that for the purpose of this example, I'm not assuming that I already know the subgenres a priori. For example, I'm not going to categorize songs into subgenres like post rock, folk rock, etc. and then pick out attributes characterizing them. I want to discover subgenres on the basis of clustering, if that makes sense.

In a nutshell, about a million songs belonging to $50$ genres and all songs collectively have $2000$ attributes. So for each song I'll make a vector in $\mathbb{R}^{2000}$ - each dimension corresponds to an attribute. If that attribute is present for that song, the corresponding element of the vector is $1$, otherwise $0$ (e.g. a jazz-related attribute will be $0$ for a rock song). Now I want to do genre-wise clustering. For each genre, not only do I want to cluster songs of that genre into groups, but I also want to get an idea which attributes are the most important to characterize individual groups.

On the basis of this clustering, I can identify subgenres (e.g. of rock music) that I can characterize using a subset of the $2000$ attributes.

My first question is: is there a better way to initialize the problem than forming 2000-dimensional vectors of ones and zeros?

Secondly, given the vast number of dimensions and examples, what clustering methods could be tried? From what I've surveyed, there are graph-based clustering methods, hierarchical, density-based, spectral clustering and so on. Which of these would be best for the toy example? I've heard that one can project the points on to a lower-dimensional subspace, then do clustering. But I also want which attributes define different clusters. Since attributes are encoded in the dimensions, with dimensionality reduction techniques I'll lose information about the attributes. So now what?

Clustering of very high dimensional data and large number of examples without losing info in dimensions

0 Answers0