0

I am currently working on a Binary Text Classification Model (False Information Detection) using Support Vector Machine and used TF-IDF as text vectorizer in Python. I have already tried training the model but upon testing, I have encountered a problem:

For example I have the model predicted an entry saying "COVID-19 is happening today" as "True", but after changing the text into "COVID-19 is not happening today", it is still predicted as "True", in which should be predicted as "False".

Where does the problem lie in this situation?

How can we make the algorithm classify text with opposite meanings like ones mentioned above?

Note:

  • The text that exists in the dataset I used in modelling is “COVID-19 is happening today.”

  • I used also predict_proba to know the probability of the text being 0(False), or 1(True). It shows that the two entries I made have the same output in predict_proba which with this I can say that it reads the two entries as the same (maybe as "COVID-19 is happening today").

  • 1
    describe in details the preprocessing you apply and how you generate the TF-IDF features from the text. I know it sounds stupid but if you're using automatic preprocessing tools they might exclude stop words for example, among which "is" and "not". That would turn those two sentences into the same one. – Edoardo Guerriero Mar 02 '22 at 11:27
  • Stop words may be a simple issue to fix, but will only get so far. It would be worth knowing how complex the statements are that you are classifying. – Neil Slater Mar 02 '22 at 12:04
  • In terms of data preprocessing, I just removed symbols, performed tokenization, stop word removal, and stemming. In terms of what you have said, I have also tried excluding negation words such as 'no', 'not', 'cant', 'dont', etc. on stop word removal but the problem is still present. As with the TF-IDF, I used sklearn's TfidfVectorizer(use_idf=True) to vectorize the text. @EdoardoGuerriero – alexand88r Mar 02 '22 at 12:09
  • Such statements present from my dataset are "The COVID19 vaccine is dangerous because 23 people died in Norway within hours of receiving it."(False information) and "The Pfizer vaccine can be safely administered to children from 5 years of age. Both Moderna and Pfizer vaccines are licensed for use in children from 12 years of age. " (True). Would it help @NeilSlater if I posted samples from the dataset I am using? – alexand88r Mar 02 '22 at 12:24
  • 1
    @alexand88r: I think if you used [edit] to add just those two examples it would help get a sense of the problem you are trying to solve. Also the size of your training dataset, because you are going to get suggestions for using more sophisticated sequence-based classifiers, but your options may be limited if you only have a few thousand or less examples. – Neil Slater Mar 02 '22 at 12:41

1 Answers1

2

Going step by step:

Preprocessing

Preprocessing is a big deal in NLP, out there you'll find many tutorials describing the classic steps but few explanations about why and when you should actually perform them. Let's go trough the steps you're performing:

  • remove symbols: usually symbols don't convey much semantic meaning, and in many cases we don't care about punctuation. Moreover, when using pretrained features like embeddings usually the dictionary don't contain vectors for such characters, hence another reason why we can remove them without loosing any information. BUT, if you're extracting your own features and if you're dealing with short sentences, these symbols might become an important source of information for a model. For example in tweets # and @ are pretty important symbols, useful to distinguish rubbish (tweets with no text but only hashtags and mentions) from good stuff. For all these reasons you might want to consider to skip this step.
  • tokenization: in its most common form tokenization simply means splitting text at words level. Of course this is useful most of the time, but something worth mentioning is that you don't have to limit yourself to generate features only from single words. And you don't have to limit yourself to just text. For example you might generate also numerical features to concatenate along with the encoded text. An example would be to count for each token how many times it appears in the sentence. The logic being that will be clear in the next steps.
  • stop words removal: this is what is causing your current issue. After word stop removal many short sentences becomes literally the same. Because stop words are extremely frequent words like "is" "are" "not" "the". Why and when using this step is basically equivalent to symbols removal. Again consider skipping this step.
  • stemming: reducing words to their common root serve the purpose to reduce the total dictionary size. Most of the times you don't care to have a feature for a word and its plural for example, combining both helps making more robust features. So this is an ok step also for your case.
  • extra: many times along with preprocessing you perform also feature extraction, meaning not the TF-IDF, which is only the last step that turns words in numbers, but rather extracting n-grams, applying simple sentiment analysis to have a positiveness score per word, and much more. Sentiment analysis in particular might suit your case, since factual sentences are most of the time neutral, while false statements most of the time convey opinions, usually expressed with more non neutral word on the positive or negative extreme of the sentiment spectrum. Long story short, be creative and help your model, artificial intelligence is stupid and it works only if you guide it with good features.

Training

Be sure to not use a linear SVM, otherwise you can be sure it will never perform well in these kind of tasks. What you're trying to learn here are really subtle hints that most likely requires lot of non linear features. Consider also trying simple deep learning models like a Multichannel CNN for text classification. They are usually much better than SVMs and faster to train. Admittedly you should have a large enough dataset, but I myself trained tweet classification models with small datasets of 1k instances achieving good performances. In this regard leveraging pretrained features like GloVe embeddings will boost a lot the final performance, since those vectors were trained on billions on documents (hence rather than training you're fine tuning a classifier on top of a pretrained language model).

Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25