Top Frequent occurrence word effect in Model Efficiency?

Question

Assume that I have a Dataframe with the text column. Problem: Classification / Prediction

    sms_text
0   Go until jurong point, crazy.. Available only ...
1   Ok lar... Joking wif u oni...
2   Free entry in 2 a wkly comp to win FA Cup fina...
3   U dun say so early hor... U c already then say...
4   Nah I don't think he goes to usf, he lives aro...

After preprocessing the text

From the above WordCloud, we can find the most frequent(occurred) words like

Free
Call
Text
Txt

As these are the most frequent words and adds less importance in prediction/classification as they appear a lot. (My Opinion)

My Question is Removing top frequent(most occurred) words will improve the model score?

How does this impact on model performance?

Is it ok to remove the most occurred words?

score 4 · Accepted Answer · answered Apr 21 '20 at 08:03

As far as I know, there are few aspects that would probably improve the model score:

Normalization

Lemmatization

Stopwords removal (as you asked here)

Based on your question, "is removing top frequent words (stopwords) will improve the model score?". The answer is, it depends on what kind of stopwords are you removing. The problem here is that if you do not remove stop words, the noise will increase in the dataset because of words like I, my, me, etc. Here is the comparison of those three aspects using SVM Classifier.

You may see that without stopwords removal the Train Set Accuracy decreased to 94.81% and the Test Set Accuracy decreased to 88.02%. But, you should be careful about what kind of stopwords you are removing.

If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. If you working with LSTM’s or other models which capture the semantic meaning and the meaning of a word depends on the context of the previous text, then it becomes important not to remove stopwords.

So, what's the solution?

You may want to create a Python package nlppreprocess which removes stops words that are not necessary. It also has some additional functionalities that can make cleaning of text fast. For example:

from nlppreprocess import NLP
import pandas as pd

nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)

Source:

@Krishna Of course you can, as long as those words don't change the meaning of the text. Because if you train your model on data that its whole meaning has changed, then it is surely going to underperform. This happens very often, after removing stopwords the whole meaning of sentence changes. — Coderio, Apr 21 '20 at 08:13

score 4 · Answer 2 · answered Apr 21 '20 at 09:31

Based on my experience, I did 2 tasks that is proven to improve the accuracy/score of my model.

Normalization
- removing characters and symbols in a text
- lowercase folding
Stopwords removal (as what you asked)

These process helped me improve my model since stopwords gave my model noise as I am using word frequency count to represent text.

So based on what you asked, does stopwords removal improve score? It depends on your model. If you are using word count to represent text you may do stopwords removal to remove noise when doing text classification.

score 1 · Answer 3 · answered Apr 16 '20 at 16:34

1

The technical term for these words is "stop words". Have a look at Information Retrieval and indexing (eg TF/IDF) to make up your mind whether you want to remove them or not.

answered Apr 16 '20 at 16:34

Oliver Mason

5,322
12
32

score 1 · Answer 4 · answered Apr 21 '20 at 08:46

Based on my project, there is how i clean and doing some preparation on the data.

Delete specific charaters ('\r', '\n', '"',)
Change into the lowercase
Delete some symbols
Lemmatization (change base word with wordnet)
Delete stopwords.

With these following step, i get some improvement accuracy score on my model.

My project: https://github.com/khaifagifari/NLP-Course-TelU

Top Frequent occurrence word effect in Model Efficiency?

4 Answers4