Why my classification results are correlated with the proportionality of my data?

Question

I'm facing a problem. I'm working on mixed data model with NN (MLP & Word Embedding). My results are not pretty good. And I observed that the proportionality of my data are corelated with my classification results. I explain:

As you can see, I have more LIVB than others data. The problem is that the predictions of my model are only LIVB True-False board

And I don't understand why ? Is it a high variance ? Is it a high bias ? What methods for classification problem should I use to detect the error ? Should I have more features ? Is my model is wrong ? Can someone has this problem before ?

Thanks for your help !

score 1 · Answer 1 · answered Apr 26 '22 at 06:36

I see two main issues here:

you have really few data
you're using a generic MLP

What you observe if just overfitting. You multi layer perceptron is just learning to predict the majority class cause that's the class that lead to the lowest error possible when chosen all the time.

For sure more features will help, along with a different architecture (CNN would be a start). But considering how few training instances you have, i wouldn't expect great results anyway.

probably to maximize you're chances of training something that will learn to predict also the minority classes with only support 5 you should consider moving to finetuning pretrained models like BERT.

Dave · Answer 2 · 2023-02-04T02:19:44.743

Despite how software might work, neural networks do not return labels. Neural networks return probabilities of class membership (typically fairly poor ones, which is a topic for a separate question). If you make probability predictions instead of having your software tell you the most probable category, I expect you to find that you have more diversity in those predictions than “LIVB every time”.

What’s happening is that LIVB is the most likely category going into the problem, and you need considerable evidence to shake your prior belief that LIVB is the most likely outcome. You are unable to produce enough evidence for another category to shake the mode away from giving LIVB the highest probability. Thus, this seems to be a matter of bias: your model lacks the ability to strongly discriminate between categories and tends to fall back on its prior probability that LIVB is most likely.

Annoyingly, it might be that this is just how your problem works: LIVB might always be the most likely outcome.

Finally, I agree with other comments that there are too few observations for a neural network to have much of a shot of being useful. Neural networks are a great way to get a lot of discriminative ability in order to get the model to scream, “This is not LIVB!” However, you probably lack the data needed for a large network not to overfit.

score 0 · Answer 3 · answered Sep 24 '22 at 14:22

In my understanding , there could be multiple problems here:

Try to check the label on the training data manually -- you will be surprised to see that the training data could be mislabeled (i.e. all your training examples could be similar to LIVB). Also do the same thing for your test data.
You could varying the parameters of your MLP.
You could try context based embedding like a sentence encoder with a CNN as pointed out by @edoardo-guerriero.
Also instead of MLP, try with a simpler algorithm like multi-class logistic regression. If the logistic regression works good with your current word embedding, then you know the issue is with the MLP ( neural network) architecture.

Why my classification results are correlated with the proportionality of my data?

3 Answers3