2

I'm building a customer assistant chatbot in Python. So, I am modelling this problem as a text classification task. I have available more or less 7 hundred sentences of an average length of 15 words (unbalanced class).

What do you think, knowing that I have to do an oversampling, is this dataset large enough?

nbro
  • 39,006
  • 12
  • 98
  • 176
Alfonso
  • 65
  • 4

1 Answers1

2

It depends on the number of classes; we are getting good results with about 40 training examples per class.

A good way to get an idea about this is to run a test with an increasing set of training data, evaluating the result as you go along. Obviously, with a small set (eg 3 sentences per class), it will be very poor, but the accuracy should quickly increase and then stabilise at a higher level. With larger amounts of data you will probably only find a small increase or no change at all.

Collecting this data would not only give you confidence in your conclusion, it would also be a good supporting argument when you have to ask for more training data, or have to justify the poor performance of the classifier if you do find the data set is too small.

So, set up an automated 10-fold cross validation, feed an increasing amount of your available data into it, sit back, and graph the results.

Oliver Mason
  • 5,322
  • 12
  • 32