How to calculate cosine similarity for classification when you have say 10000 samples belonging to two classes have a bunch of samples

Question

Does anyone have experience with using Cosine Similarity for text classification? I see a number of articles on how to find cosine similarity between documents using Doc2Vec, Gensim, etc.

I have a classification problem (binary) where I want to try out the cosine similarity. I do know how to calculate it, but all the articles that I see only explain until the point of calculating it between two documents.

Right now, I am planning to do this.

Calculate the cosine similarity of 'my paragraph' (the one that I want to classify) with all samples in classi (their class is known). Then take the average (call that avgi)
Calculate the cosine similarity of my paragraph (the one that I want to classify) with all samples in classo (their class is known). Then take the average (call that avgo)
Compare avgi and avgo and then predict the class for 'my paragraph'

That sounds like a very manual way of doing it. Is there some better/widely used way of doing it?

@Sanny28 It seems that you want to use the cosine similarity to build a model that decides which class your paragraph belongs to? Why do you want to use this? Why don't you want to train a model that takes as input the paragraphs and produces the class (if you have a labelled dataset)? It's also not clear what your problem really is. Are you asking how to efficiently compute the cosine similarity when you have too many paragraphs? — nbro, Jul 24 '21 at 12:45
Yes, what I want is a model that takes a paragraph and produces a class. I have done that successfully using other algorithms like RF, logistic regression and ANN. Someone suggested I try out cosine similarity too. I tried to get sample code from internet. All the articles that I see only explain how to calculate cosine similarity between two paragraphs. No one seems to explain how to use it do classification (compare with a set of paragraphs in both classes which are already classified). That is why i thought of the average method I explained in the question. — Sanny28, Jul 25 '21 at 13:21

How to calculate cosine similarity for classification when you have say 10000 samples belonging to two classes have a bunch of samples

0 Answers0