3

Assume that I have a Dataframe with the text column. Problem: Classification / Prediction

    sms_text
0   Go until jurong point, crazy.. Available only ...
1   Ok lar... Joking wif u oni...
2   Free entry in 2 a wkly comp to win FA Cup fina...
3   U dun say so early hor... U c already then say...
4   Nah I don't think he goes to usf, he lives aro...

After preprocessing the text enter image description here

From the above WordCloud, we can find the most frequent(occurred) words like

Free
Call
Text
Txt

As these are the most frequent words and adds less importance in prediction/classification as they appear a lot. (My Opinion)

My Question is Removing top frequent(most occurred) words will improve the model score?

How does this impact on model performance?

Is it ok to remove the most occurred words?

Pluviophile
  • 1,223
  • 5
  • 17
  • 37

4 Answers4

4

As far as I know, there are few aspects that would probably improve the model score:

  1. Normalization
  2. Lemmatization
  3. Stopwords removal (as you asked here)

Based on your question, "is removing top frequent words (stopwords) will improve the model score?". The answer is, it depends on what kind of stopwords are you removing. The problem here is that if you do not remove stop words, the noise will increase in the dataset because of words like I, my, me, etc. Here is the comparison of those three aspects using SVM Classifier.

Comparison using SVM Classifier

You may see that without stopwords removal the Train Set Accuracy decreased to 94.81% and the Test Set Accuracy decreased to 88.02%. But, you should be careful about what kind of stopwords you are removing.

If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. If you working with LSTM’s or other models which capture the semantic meaning and the meaning of a word depends on the context of the previous text, then it becomes important not to remove stopwords.

So, what's the solution?

You may want to create a Python package nlppreprocess which removes stops words that are not necessary. It also has some additional functionalities that can make cleaning of text fast. For example:

from nlppreprocess import NLP
import pandas as pd

nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)

Source:

  1. https://github.com/miguelfzafra/Latest-News-Classifier

  2. https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

Coderio
  • 166
  • 4
4

Based on my experience, I did 2 tasks that is proven to improve the accuracy/score of my model.

  1. Normalization
    • removing characters and symbols in a text
    • lowercase folding
  2. Stopwords removal (as what you asked)

These process helped me improve my model since stopwords gave my model noise as I am using word frequency count to represent text.

So based on what you asked, does stopwords removal improve score? It depends on your model. If you are using word count to represent text you may do stopwords removal to remove noise when doing text classification.

1

The technical term for these words is "stop words". Have a look at Information Retrieval and indexing (eg TF/IDF) to make up your mind whether you want to remove them or not.

Oliver Mason
  • 5,322
  • 12
  • 32
1

Based on my project, there is how i clean and doing some preparation on the data.

  1. Delete specific charaters ('\r', '\n', '"',)
  2. Change into the lowercase
  3. Delete some symbols
  4. Lemmatization (change base word with wordnet)
  5. Delete stopwords.

With these following step, i get some improvement accuracy score on my model.

My project: https://github.com/khaifagifari/NLP-Course-TelU