I am trying to build a film review classifier where I determine if a given review is positive or negative (w/ Python). I'm trying to avoid any other ML libraries so that I can better understand the processes. Here is my approach and the problems that I am facing:
- I mine thousands of film reviews as training sets and classify them as positive or negative.
- I parse through my training set and for each class, I build an array of unique words.
- For each document, I build a vector of TF-IDF values where the vector size is my number of unique words.
- I use a Gaussian classifier to determine: $$P(C_i|w)=P(C_i)P(w|C)=P(C_i)*\dfrac{1}{\sqrt{2\pi}\sigma_i}e^{-(1/2)(w-\mu_i)^T\sigma_i^{-1}(w-\mu_i)}$$ where $w$ is the my document in a vector, $C_i$ is a particular class, $\mu_i$ is the mean vector and $\sigma_i$ is my covariance matrix.
This approach seems to make sense. My problem is that my algorithm is much too slow. As an example, I have sampled over 1,500 documents and I have determined over 40,000 unique words. This means that each of my document vectors has 40,000 entries and if I were to build a covariance matrix, it would have dimensions 40,000 by 40,000. Even I were able to generate the entirety of $\sigma_i$, but then I would have to compute the matrix product in the exponent, which will take an extraordinarily long time just to classify one document.
I have experimented with a multinomial approach, which is working well. I am very curious about how to make this work more efficiently. I realise the matrix multiplication runtime can't be improved, and I was hoping for insight on how others are able to do this.
Some things I have tried:
- Filtered any stop words (but this still leaves me with tens of thousands of words)
- Estimated $\sigma_i$ by summing over a couple of documents.