What techniques to explore for dynamic clustering of documents (emails)?

Question

I have a dataset of unlabelled emails that fall into distinct categories (around a dozen). I want to be able to classify them along with new ones to come in the future in a dynamic matter. I know that there are dynamic clustering techniques that allow the clusters to evolve over time ('dynamic-means' being one of them). However, I would also like to be able to start with a predefined set of classes (or clusters/centroids), as I know for a fact what the types of those emails will be.

Furthermore, I need some guidance in terms of what vectorisation technique to use for my type of data. Would creating a term matrix using TF-IDF be sufficient? I assume that the data I am dealing with could be differentiated on the basis of keyword occurrence, but I cannot tell to what degree. Are there more sophisticated vectorisation techniques based more on the text semantics? Are they worth exploring?

`that allow the clusters to evolve over time` Just to confirm, does that mean that you want to add more categories in the future? — cantordust, Jul 03 '18 at 23:46
Thanks for you response. Like I said in the comment below - I have a rough idea what the types of emails will be and want to cluster them around this idea. First of all, I believe I should not start the clustering procedure with a random set of centroids, but rather predefined ones. Secondly, since I can only try to define my entry points for the algorithm basing on the current dataset, I think it would be reasonable to let the centroids move around as new datapoints arrive. I think that I'd like to stick with a fixed number of clusters for now. — Jan Parzydło, Aug 22 '18 at 14:14

score 1 · Answer 1 · answered Aug 08 '18 at 14:41

1

It sounds like you are trying to do some kind of semi-supervised learning. In semi-supervised learning, some data points are labelled (you know which class they belong to), and others are not. There are classification algorithms designed specifically for this kind of problem, like a transductive-SVM. I personally have not found these techniques to be more effective than simply discarding the unlabelled data and treating my problem as purely supervised, but YMMV.

TFIDF remains fairly popular, as do ngram-based approaches. A more modern vectorization to consider might be word2vec, which translated something like a bag-of-words style vector into a more meaningful feature space for words.

answered Aug 08 '18 at 14:41

John Doucette

9,147
1
17
52

Thanks. Actually, in my case all the data is unlabelled, hence the term 'clustering'. What I want to do is separate them into a predefined number of categories. There is a number of factors that I do now how to handle: 1. how to ensure a good starting position for the clusters b) how to enable refinement of clusters' definitions over time. As for the vectorisation techniques, I will definitely look into word2vec. – Jan Parzydło Aug 22 '18 at 14:06
@JanParzydło The standard clustering technique is K-Means clustering. (https://en.wikipedia.org/wiki/K-means_clustering). This may be slow if you have high-dimensional data, but can use k-medoids (https://en.wikipedia.org/wiki/K-medoids) with a pre-computed kernel instead if this is a problem. – John Doucette Aug 22 '18 at 15:41

score 0 · Answer 2 · answered Oct 08 '18 at 05:49

I would also like to be able to start with a predefined set of classes (or clusters/centroids) as I know for a fact what the types of those emails will be.

This is not a clustering problem, but a semi-supervised learning problem. If you don't have labeled data yet, then create some labels. You might also want to look into "active learning".

One approach is:

For each category, create 5 labeled samples
Train a classifier on them (e.g. tf-idf features and a small neural network)
Let the neural network label your dataset
Check the labels where it was most confident for all classes and the ones where the probabilities for all classes were most evenly spread. Use this to quickly create more labels.
Maybe Amazon mechanical Turk is an option to quickly generate more labels

What techniques to explore for dynamic clustering of documents (emails)?

2 Answers2