I am working on a project where I have a dataset consisting of unstructured data from multiple ERP systems. Each dataset (extracted from an ERP) has different columns, and unfortunately, there is no standard format for the data. Among the columns, there is a product code, along with other product-related information. The product code can be in various columns, or even within a larger column description.
My goal is to extract the product code from each row from this unstructured data. I am looking for advice on which type of model or strategy I can apply to extract the product codes automatically.
Here are some key points for consideration:
- The data is unstructured and comes from various ERP systems;
- There is no standard format for the columns or the product code placement;
- The product code can be within a larger column description;
- The extract data isn't natural language (it doesn't have a syntax). I don't have actual sequences. The columns are extracted from an ERP system and they basically contain a bunch of keywords, like: '3/4" SHOE RED BLUE NIKE", stuff like that;
- The are millions of possible product codes.
Any suggestions or recommendations on models, strategies, or tools that would help me achieve this goal would be greatly appreciated. If you have any experience with a similar problem, please share your insights or any relevant resources.
Thank you in advance for your help!