How to remove boilerplate (or extract main content) from web pages?

Asked Oct 03 '22 at 08:11

Active Oct 03 '22 at 08:11

Viewed 85 times

Data: Raw source code of a website and the final cleaned main content I want to extract from the raw source code. The source code comes from different websites with different layouts and code structures. They all belong thematically to the same domain.

Problem: I want to remove everything but the main content from the source code, including all boilerplate. I have looked at the work of Web2Text and Boilernet so far. While Web2Text used CNNs and Boilernet used bidirectional LSTMs, I want to use a transformer-based approach. Can anyone explain to me how to prepare the data or how to train a model that will receive the source code as input and output the text? Can anyone point me in the right direction for boilerplate removal resources based on transformers?

Thank you in advance!

asked Oct 03 '22 at 08:11

nesquick

How to remove boilerplate (or extract main content) from web pages?

0 Answers0