I'm trying to get a grasp on scalability of clustering algorithms, and have a toy example in mind. Let's say I have around a million or so songs from $50$ genres. Each song has characteristics - some of which are common across all or most genres, some of which are common to only a few genres and some of which are genre-specific.
Common attributes could be something like song duration, artist, label, year, album, key, etc. Genre-specific attributes could be like lead guitarist, trombone player, conductor, movie name (in case of movie soundtracks), etc. Assume that there are, say, $2000$ attributes across all possible genres.
The aim is to identify attributes that characterize subgenres of these genres. So of course, let's say for rock I can just collect all the attributes for all rock songs, but even that set of attributes may be too broad to characterize the rock genre - maybe there are some that are specific to subgenres and so I won't have the desired level of granularity.
Note that for the purpose of this example, I'm not assuming that I already know the subgenres a priori. For example, I'm not going to categorize songs into subgenres like post rock, folk rock, etc. and then pick out attributes characterizing them. I want to discover subgenres on the basis of clustering, if that makes sense.
In a nutshell, about a million songs belonging to $50$ genres and all songs collectively have $2000$ attributes. So for each song I'll make a vector in $\mathbb{R}^{2000}$ - each dimension corresponds to an attribute. If that attribute is present for that song, the corresponding element of the vector is $1$, otherwise $0$ (e.g. a jazz-related attribute will be $0$ for a rock song). Now I want to do genre-wise clustering. For each genre, not only do I want to cluster songs of that genre into groups, but I also want to get an idea which attributes are the most important to characterize individual groups.
On the basis of this clustering, I can identify subgenres (e.g. of rock music) that I can characterize using a subset of the $2000$ attributes.
My first question is: is there a better way to initialize the problem than forming 2000-dimensional vectors of ones and zeros?
Secondly, given the vast number of dimensions and examples, what clustering methods could be tried? From what I've surveyed, there are graph-based clustering methods, hierarchical, density-based, spectral clustering and so on. Which of these would be best for the toy example? I've heard that one can project the points on to a lower-dimensional subspace, then do clustering. But I also want which attributes define different clusters. Since attributes are encoded in the dimensions, with dimensionality reduction techniques I'll lose information about the attributes. So now what?