1

I'm currently trying to build a semantic scraper that can extract product information from different company websites of suppliers in the packaging industry (with as little manual customization per supplier/website as possible).

The current approach that I'm thinking of is the following:

  1. Get all the text data via scrapy (so basically a HTML-tag search). This data would hopefully be already semi-structured with for example: name, description, product image, etc.
  2. Fine-tune a pre-trained NLP model (such as BERT) on a domain specific dataset for packaging to extract more information about the product. For example: weight and size of the product

What do you think about the approach? What would you do differently?

One challenge I already encountered is the following:

  • Not all of the websites of the suppliers are as structured as for example e-commerce sites are → So small customisations of the XPath for all websites is needed. How can you scale this?

Also does anyone know an open-source project as a good starting point for this?

johannesha
  • 11
  • 1

0 Answers0