5

(I apologize for the title being too broad and the question being not 'technical')

Suppose that my task is to label news articles. This means that given a news article, I am supposed to classify which category that news belong to. Eg, 'Ronaldo scores a fantastic goal' should classify under 'Sports'.

After much experimentation, I came up with a model that does this labeling for me. It has, say, 50% validation accuracy. (Assume that it is the best)

And so I deployed this model for my task (on unseen data obviously). Of course, from a probabilistic perspective, I should get roughly 50% of the articles labelled correctly. But how do I know that which labels are actually correct and which labels need to be corrected? If I were to manually check (say, by hiring people to do so), how is deploying such a model better than just hiring people to do the classification directly? (Do not forget that the manpower cost of developing the model could have been saved.)

DukeZhou
  • 6,237
  • 5
  • 25
  • 53
  • Welcome to SE:AI! – DukeZhou Oct 24 '19 at 00:51
  • "I should get roughly 50% of the articles labelled correctly." You should? If there are n categories, each article fitting one category, and articles are evenly distributed across categories, random luck would give you a 1/n chance of getting it correct for one article. If it's just sports vs non-sports, (again evenly distributed), sure, 50% is no better than random luck, but that doesn't seem to be the case. – muru Oct 24 '19 at 02:04

2 Answers2

10

There are several advantages:

  1. Some text classification systems are much more accurate than 50%. For example, most spam classification systems are 99.9% accurate, or more. There will be little value to having employees review these labels.
  2. Many text classification systems can output a confidence as well as a label. You can selectively have employees review only the examples the model is not confident about. Often these will be small in number.
  3. You can usually test a text classification model by having it classify some unseen data, and then asking people to check the work. If you do this for a small number of examples, you can make sure the system is working. You can then confidently use the system on a much large set of unlabeled examples, and be reasonably sure about how accurate it is.
  4. For text, it is also important to measure how much different people agree on the ratings. You are unlikely to do better than this, because this gives you a notion of the subjectivity of the specific problem you are working on. If people disagree 50% of the time anyway, maybe you can accept a 50% failure rate from the automated system, and not bother checking its work.
John Doucette
  • 9,147
  • 1
  • 17
  • 52
  • 1
    I don't think that the spam classification system associated with my mail client is so accurate. For example, yesterday, it's labeled two e-mails as junk, while it was not the case. – nbro Oct 23 '19 at 16:08
  • 2
    @nbro You are correct, I had misremembered the statistics. It's been at 99.9% or higher for quite a long time (more than a decade). Even though the classifiers keep improving, spam authors keep using more sophisticated techniques, so it has the nature of an arms race. – John Doucette Oct 23 '19 at 17:26
  • 2
    @nbro: You also need to account for the fact that the majority of the email you receive is likely to be spam. The spam that you correctly don't see counts for most of that 99.9% accuracy rating. – Neil Slater Oct 23 '19 at 19:32
  • @NeilSlater If the majority of the e-mail I receive is spam, then I do not expect two e-mails to be marked (on the same day) as spam when they are not, if the predictive system is really understanding something about the difference between spam and not spam. In other words, if 99% of the e-mails are spam, then you can just predict always spam and you'll get a nice accuracy. Clearly, this accuracy is highly misleading. – nbro Oct 23 '19 at 19:57
  • 1
    @nbro FWIW, most spam classification literature uses AUC as a measurement. They typically have 99.99% AUC. This means that they can obtain extremely high accuracy on _both_ classes, despite the class imbalance – John Doucette Oct 23 '19 at 23:48
  • 1
    I believe that point 2 actually answers my question the most. I have also come across a method that uses the predictions of multiple models. If those models all produce the same prediction, then it is likely that one can treat that prediction as 'correct' and thus save cost. – Air Christmas Oct 24 '19 at 12:20
0

First of all to be more real, you usually expect more than 50% validation accuracy on articles predictions.

Back on your question, you should definitely try to automate this process if you are looking for a long-term solution of labeling articles. Deploying such model should not cost more than hiring employees to do this manually, at least for a long-term perspective.

theonekeyg
  • 81
  • 1
  • 4