1

I am interested in running a simple but hopefully rigorous algorithm which “learns” how to parse HTML - ideally optimally fast.

I have never done this before and I’m wondering if there’s a go-to algorithm or library for this.

I would prefer to start with an extremely simple algorithm even if it is not fast or efficient - but effective. Something like, I give it a corpus of perfect HTML with no errors, and it should generate every possible parsing “hypothesis”, and score each one.

Maybe a simple way to represent a successful “parse” (for a first version of the project) is just the insertion of spaces between HTML language elements.

So, let’s say the model may receive HTML like

< html > text </ html >

and

<html>text</html> (or something like that, anyway),

through trial and error it should have generated every possible arrangement of characters in ASCII/UTF-8, and build a tree of states depending on what it gets in some given input string. If it gets “a”, maybe the result has any other letter as of identical decision-relevance, but a “<“ as a different branch on the decision tree (or a cyclical decision graph)?

I’m completely new to this and excited to finally be trying out simple, deterministic, optimizing grammar induction, which I have been learning about for the last 6 months and finally would like to try.

I can read up on some resources here for this:

https://nlp.stanford.edu/projects/up-gi.shtml

Thanks.

hmltn
  • 103
  • 9
  • Just an FIY THIS is what I have been looking for for months - https://en.wikipedia.org/wiki/Symbolic_regression - but could still use help in implementing a simple first run of it. – hmltn Jun 22 '23 at 10:04

0 Answers0