I am interested in running a simple but hopefully rigorous algorithm which “learns” how to parse HTML - ideally optimally fast.
I have never done this before and I’m wondering if there’s a go-to algorithm or library for this.
I would prefer to start with an extremely simple algorithm even if it is not fast or efficient - but effective. Something like, I give it a corpus of perfect HTML with no errors, and it should generate every possible parsing “hypothesis”, and score each one.
Maybe a simple way to represent a successful “parse” (for a first version of the project) is just the insertion of spaces between HTML language elements.
So, let’s say the model may receive HTML like
< html > text </ html >
and
<html>text</html>
(or something like that, anyway),
through trial and error it should have generated every possible arrangement of characters in ASCII/UTF-8, and build a tree of states depending on what it gets in some given input string. If it gets “a”, maybe the result has any other letter as of identical decision-relevance, but a “<“ as a different branch on the decision tree (or a cyclical decision graph)?
I’m completely new to this and excited to finally be trying out simple, deterministic, optimizing grammar induction, which I have been learning about for the last 6 months and finally would like to try.
I can read up on some resources here for this:
https://nlp.stanford.edu/projects/up-gi.shtml
Thanks.