I'm looking to write an AI that will be able to extract in text references from standards documents to assist human research.
My use case is extracting the identifying numbers, for example, "AR 25-2", along with the title of the document "Information Assurance" so that a human can gather all the related research on a contract at once, instead of having to keep track of references while they're reading through the document.
I have a pretty good idea of where to gather the names of these documents for training, I'm planning on 'scraping' a few repositories for different categories of these documents.
What kind of model should I use to get the best results?