I have a dataset of unlabelled emails that fall into distinct categories (around a dozen). I want to be able to classify them along with new ones to come in the future in a dynamic matter. I know that there are dynamic clustering techniques that allow the clusters to evolve over time ('dynamic-means' being one of them). However, I would also like to be able to start with a predefined set of classes (or clusters/centroids), as I know for a fact what the types of those emails will be.
Furthermore, I need some guidance in terms of what vectorisation technique to use for my type of data. Would creating a term matrix using TF-IDF be sufficient? I assume that the data I am dealing with could be differentiated on the basis of keyword occurrence, but I cannot tell to what degree. Are there more sophisticated vectorisation techniques based more on the text semantics? Are they worth exploring?