1

I am working on a project where I have a dataset consisting of unstructured data from multiple ERP systems. Each dataset (extracted from an ERP) has different columns, and unfortunately, there is no standard format for the data. Among the columns, there is a product code, along with other product-related information. The product code can be in various columns, or even within a larger column description.

My goal is to extract the product code from each row from this unstructured data. I am looking for advice on which type of model or strategy I can apply to extract the product codes automatically.

Here are some key points for consideration:

  • The data is unstructured and comes from various ERP systems;
  • There is no standard format for the columns or the product code placement;
  • The product code can be within a larger column description;
  • The extract data isn't natural language (it doesn't have a syntax). I don't have actual sequences. The columns are extracted from an ERP system and they basically contain a bunch of keywords, like: '3/4" SHOE RED BLUE NIKE", stuff like that;
  • The are millions of possible product codes.

Any suggestions or recommendations on models, strategies, or tools that would help me achieve this goal would be greatly appreciated. If you have any experience with a similar problem, please share your insights or any relevant resources.

Thank you in advance for your help!

delucca
  • 11
  • 4
  • 2
    It would be very useful to see an example of the data. Like an image? Also, please explain what an ERP is as not everyone will know and how that information is needed for understanding the case. – Robin van Hoorn Apr 06 '23 at 08:53

1 Answers1

0

There are two class of methods that could potentially solve this: hardcoded rules or learning. As with any other problem, go for the first one (rules) first and only if it's not enough go for the second one (learning).

Here's how I would proceed:

  • try to think about what makes something a product code or not. For instance, if you get a column and split it in strings separated by spaces or column separators, you could find things such as
    • minimum length,
    • combinations of letter, numbers and characters that required, allowed or forbidden,
    • maximum length,

Even if these criteria aren't enough to get exactly the codes you want they will help you on the next step. The important thing is that as much as possible you shouldn't miss any valid code. You should refine the rules as much as possible and if they aren't enough you can start learning.

  • With the first method you have a way of generating a dataset of potential codes. Now you should look at the dataset and find those datapoints that aren't codes. If you can create a rule to exclude them, do it. If not, annotate them by hand (this is, add a "wrong" flag to those). Then you can train a model to classify the potential codes as correct or incorrect.

Let me know if this didn't answer your question and if you have questions :)

  • Thanks for your detailed response! I have a couple of questions: - Unfortunately we can't specify what makes a "product code". The brand of the project has it's own rules, and sometimes the brand doesn't follow that rules either. So, one product can have a product code of "ABC" and another one "13123KDD-D/3". The only thing that is certain is that the product code is usually a random string combination (with some arbitrary logic) (continue on the next message) – delucca Apr 12 '23 at 13:50
  • - One thing that I know by exploring the data is that the product code is usually composed by 2 parts. The first string is usually the "real" code (that is shared across multiple products), while the rest of the code is usually a "description" (think about a "product type" + "product specification" – delucca Apr 12 '23 at 13:51
  • Based on the above, I was able to code a NER model using Spacy that identifies both parts (the "product code" and "product description"). I got ~80% accuracy, but it miss most of the times the "description", while it gets right most of the code itself. I tried using Spacy Spancat, but I was not able to do this. Do you have any suggestion how to improve that? Specially the part of getting the actual description. – delucca Apr 12 '23 at 13:52