I'm designing a NLP model to extract various kinds of "hidden" expenses from 10-K and 10-Q financial statements. I've come up with about 7 different expense categories (restructuring costs, merger and acquisitions, etc.) and for each one I have a list of terms/synonyms that different companies call them. I'm new to NLP would like some advice on the best approach for extracting them.
Values are usually hidden in two different areas of the document:
Type 1: Free-form text (footnotes)
Values are nested in sentences. Here are some examples, with the Expense Type and Monetary value indicated.
Exploratory dry-hole costs were \$12.7 million, \$1.3 million, and \$1.0 million for the years ended December 31, 2012, 2011, and 2010, respectively.
2012 includes the recognition of a $3,340 million impairment charge related to the carrying value of Citi's remaining 35% interest in the Morgan Stanley Smith Barney joint venture
During the year ended December 31, 2017, we decided to discontinue the internal development of AMG 899, resulting in an impairment charge of $400 million for the IPR&D asset
Type 2: Table data
SEC statements also contain "structured" data in HTML tables. Some line items, like the first row below, correspond to the expense type I'm looking for:
Item | 2020 | 2019 | 2018 |
---|---|---|---|
impairment related to real estate assets(2): | 398.2 | 200 | 0 |
research and development | 100 | 200 | 300 |
other expenses | 20 | 30 | 40 |
Correct value = 398.2
I'm thinking about a two-model approach:
Define a new NER model based off the terms I already know (e.g. "dry-hole costs", "impairment charges"). I would need to manually annotate extracts from historic statements that contain these terms for the training set.
- For free-form text, it would match the sentence and pass it on for further processing (see 2).
- For table data, I would loop over each row using beautifulsoup and pandas, check the first column for a match (e.g. using spaCy's comparison function), and then grab that year's value from the dataframe and finish.
For free-form matches, I still need to grab the monetary value for the correct year (sometimes multiple values are given for various years, see the first example above).
One potential problem here is that sentences like this would cause problems:
We gained $100 million this year, despite facing restructuring charges.
If the NLP algo is split into the above two-model process, model 1 would pass (because it contains a known term like "restructuring charges"), and model 2 would extract $100 million
, which is incorrect because it doesn't actually correspond to the expense itself.
Is there a better solution here? As I said, I'm new to NLP and data extraction so would really appreciate any advice or resources to learn more about solving these types of key/value problems.