We receive PDF files that contain valuable data in a somewhat structured way. There are a lot of variations, but it roughly translates to a 3 column layout.
2021 | 2020 | |
---|---|---|
Label 1 | 50 | 197 |
Label 2 | 100 | 100 |
A long Label with multiple words | 90 | 120 |
We receive the data in JSON format with the positions of the text/values in a coordinate system, but we can structure it in another form if needed. E.G:
[
{ x:250, y:70, text: "2021" }
,{ x:500, y:70, text: "2022" }
,{ x:10, y:100, text: "Label 1" }
,{ x:250, y:100, text: "50" }
,{ x:500, y:100, text: "197" }
,{ x:10, y:120, text: "Label 2" }
...
]
This example has perfect coordinates which would make it easier. Real data unfortunately does not. For example the X for 2021 could be 200 and the value 270, but for 2020 it could be 490 & 510. The X is based on the text size and with centered text, longer text has a lower X. I could calculate the center of each word, but even that is a little skewed.
The end result of our tool would be that we know the value for each label and year.
Label 1 value for 2021 = 50
Label 1 value for 2020 = 197
I know this will most likely be a multi step solution, but I'm having a hard time structuring the data and finding the best way to do it. But I can't find a way to use the row & column simultaneous.
I tried combining the entire page as a single text and then using spaCy
to extract data, but it will generate documents
based on each line and disregarding column.
I also tried using Tabula
to generate the data to a more clear table, but because of the skewing the rows/columns don't always match. "2021" could be column B and the values in column C.
I also tried reversing it the whole process to match if a line is a certain label. I would train the network on a sentence like A long Label with multiple words 90 120
with class Long label
and then be able to pass in any line to find if it is in fact a Long label
and then proceed to get the two values. I got decent results in a first attempt, but again there is no way to know if "90" or "120" belongs to "2021".
What could be a good way to structure this data and fit it to what model?