10

I have two Machine Learning models (I use LSTM) that have a different result on the validation set (~100 samples data):

  • Model A: Accuracy: ~91%, Loss: ~0.01
  • Model B: Accuracy: ~83%, Loss: ~0.003

The size and the speed of both models are almost the same. So, which model should I choose?

nbro
  • 39,006
  • 12
  • 98
  • 176
malioboro
  • 2,729
  • 3
  • 20
  • 46
  • 1
    Interesting question, I have also faced the same situation....But then it turned out it was a local minimum when I did some hyperparameter adjustments –  Feb 07 '19 at 07:37
  • The last question is not clear, please clarify. – ssegvic Feb 15 '19 at 10:31

3 Answers3

6

You should choose the model A. The loss is just a differentiable proxy for accuracy.

That said, the situation should be examined in more detail. If the higher loss is due to the data term, examine the data which produce high loss and check for presence of overfitting or incorrect labels.

If the higher loss is due to a regularizer then reducing the regularization factor may further improve the results.

ssegvic
  • 489
  • 2
  • 6
  • What do you mean precisely by "X is a _differentiable_ proxy for Y"? I think you should clarify this. – nbro Feb 23 '19 at 17:51
  • 1
    It means that a differentiable function X (eg. the negative log loss) is used as a replacement (proxy) for Y (accuracy) which is unsuitable due to being non-differentiable. – ssegvic Feb 23 '19 at 20:26
  • But if there is a "one-to-one" correspondence between accuracy and loss, then why can you obtain the results the OP is describing? – nbro Feb 23 '19 at 20:32
  • A proxy is... ...just a proxy (an upper bound to be more precise). Not the real thing. A single datapoint may result in a very high loss contribution while only mildly affecting accuracy. One also may have high loss due to regularization term even when the accuracy is 100%. The only way to find out the reason for the discrepancy is to have a look at the data. – ssegvic Feb 23 '19 at 21:08
4

You should note that both your results are consistent with a "true" probability of 87% accuracy, and your measurement of a difference between these models is not statistically significant. With an 87% accuracy applied at random, then there is approx 14% chance of getting the two extremes of accuracy you have observed by chance if samples are chosen randomly from the target population, and models are different enough make errors effectively at random. This last assertion is usually not true though, so you can relax a little - that is, unless you took different random slices for cross-validation in each case.

100 test cases is not really enough to discern small differences between models. I would suggest using k-fold cross-validation in order to reduce errors in your accuracy and loss estimates.

Also, it is critical to check that the cross-validation split was identical in both cases here. If you have used auto-splitting with a standard tool and not set the appropriate RNG seed, then you may have got a different set each time, and your results are just showing you variance due to the validation split which could completely swamp any differences between the models.

However, assuming the exact same dataset was used each time, and it was representative sample of your target population, then on average you should expect the one with the best metric to have the highest chance of being the best model.

What you should really do is decide which metric to base the choice on in advance of the experiment. The metric should match some business goal for the model.

Now you are trying to choose after the fact, you should go back to the reason you created the model in the first place and see if you can identify the correct metric. It might not be either accuracy or loss.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • thank you! it's a nice Idea to use K-Fold CV. I use my model for industrial case, it's a text classification case. Some people said choose the smallest loss because high accuracy can lead to overfitting in the future, is that always true? – malioboro Feb 18 '19 at 14:45
  • @malioboro: I don't think that is true. Without any theory to back it up, it is an empty statement in any case. I could equally say "choose the highest accuracy, because low loss can lead to overfitting in future". What do your "some people" say about that - i.e. what is their theory, and under what assumptions does it work? Either way, ignore them and choose the correct metric for your problem as suggested by its intended use. – Neil Slater Feb 18 '19 at 14:54
  • @NeilSlater Ah.. I see.. so there aren't any theory that said lower loss is safe from overfitting. Could you please give a few examples of problem and its correct metric in your answer? I'm not pretty sure with the metric for my problem – malioboro Feb 20 '19 at 08:53
  • Where does "You should note that both your results are consistent with a "true" probability of 87% accuracy, and your measurement of a difference between these models is not statistically significant." come from? Can you explain a little better these sentence in relation to the results presented by the OP? – nbro Feb 23 '19 at 17:55
  • @nbro: It comes from a very brief experiment assuming that the accuracy is 87%. This is probably not strictly valid mathematically, I simply ran a few 1000 trials with p =0.87 and 100 tests, and counted number of times it saw 91+ or 83-. That's more than a bit hacky, so I don't really want to go into details of that. The take-away is that error bars on the "91% accuracy" measurement using only 100 test samples are large. – Neil Slater Feb 23 '19 at 18:07
1

It depends on your application! Imagine a binary classifier that is always very "confident" - it always assigns P=100% to Class A and 0% to Class B, or vice versa (sometimes wrong, never uncertain!). Now imagine a "humble" model that is perhaps fractionally less accurate, but whose probabilities are actually meaningful (when it says "Class A with probability 70%" it is wrong 30% of the time).

In your case, both losses are quite small, so we probably prefer the more accurate one.

Edward Dixon
  • 327
  • 2
  • 6