Data: Raw source code of a website and the final cleaned main content I want to extract from the raw source code. The source code comes from different websites with different layouts and code structures. They all belong thematically to the same domain.
Problem: I want to remove everything but the main content from the source code, including all boilerplate. I have looked at the work of Web2Text
and Boilernet
so far. While Web2Text
used CNNs and Boilernet
used bidirectional LSTMs, I want to use a transformer-based approach. Can anyone explain to me how to prepare the data or how to train a model that will receive the source code as input and output the text? Can anyone point me in the right direction for boilerplate removal resources based on transformers?
Thank you in advance!