3

I have order data, here's a sample:

Ninety-six (96) covered pans, desinated mark cutlery.
5 vovered pans by knife co.
(SEE SCHEDULE A FOR NUMBERS). 757 SOUP PANS
115 10-quart capacity pots.
Thirteen (13), 30 mm thick covered pans. 

I have over 50k rows of data such as this. In a perfect world, the above would need to be tabulated as such:

count, type
96, covered pan
5, covered pan
757, soup pan
115, pot
13, covered pan

Could machine learning be the correct approach for a problem such as this?

lewicki
  • 139
  • 1

2 Answers2

1

Yes a variant of NLP processing could help find the correct number to extract and type of object in this data.

Compared to the spreadsheet, the raw text data is ambiguous without understanding language to a reasonable depth, and without knowing the business context in order to extract the relevant information.

For instance, you are expecting to extract "soup pan" and "covered pan", but not "capacity pot". Also that parts of phrases such as "30 mm" or "10-quart" are lower importance qualifiers, and not specifying quantity of something.

The current state of the art for extracting this kind of data would be a bidirectional LSTM (a type of Recurrent Neural Network). You would likely get it to flag the parts of each entry that were relevant to the tabulated data you wanted to extract, then feed those into a simpler stage that put them into the spreadsheet. However, there are two caveats:

  • You need a lot of correctly-labelled training data to get reasonable performance. Using a word embedding layer, such as word2vec or GloVe, should significantly reduce the amount of training data required, but may require a careful pre-processing stage, and may be less useful when you have a lot of jargon in your data.

  • Performance is never perfect, and the system can still make stupid mistakes, because it does not truly understand the text is is dealing with. That applies to all ML approaches to this problem, and likely also to coding up an "expert system", although it may be easier to write the expert system to recognise when it had failed and ask for help.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
-1

Maybe a simple regex solve this question. But problably you need supervise, approve and disapprove anomalies.

See this example, I had the same problem some time ago: https://stackoverflow.com/questions/50689935/regex-like-commands-python

GIA
  • 568
  • 6
  • 22