Find value in roughly tabular structured textual data from PDF

Question

We receive PDF files that contain valuable data in a somewhat structured way. There are a lot of variations, but it roughly translates to a 3 column layout.

	2021	2020
Label 1	50	197
Label 2	100	100
A long Label with multiple words	90	120

We receive the data in JSON format with the positions of the text/values in a coordinate system, but we can structure it in another form if needed. E.G:

[
{ x:250, y:70, text: "2021" }
,{ x:500, y:70, text: "2022" }
,{ x:10, y:100, text: "Label 1" }
,{ x:250, y:100, text: "50" }
,{ x:500, y:100, text: "197" }
,{ x:10, y:120, text: "Label 2" }
...
]

This example has perfect coordinates which would make it easier. Real data unfortunately does not. For example the X for 2021 could be 200 and the value 270, but for 2020 it could be 490 & 510. The X is based on the text size and with centered text, longer text has a lower X. I could calculate the center of each word, but even that is a little skewed.

The end result of our tool would be that we know the value for each label and year.

Label 1 value for 2021 = 50

Label 1 value for 2020 = 197

I know this will most likely be a multi step solution, but I'm having a hard time structuring the data and finding the best way to do it. But I can't find a way to use the row & column simultaneous.

I tried combining the entire page as a single text and then using spaCy to extract data, but it will generate documents based on each line and disregarding column.

I also tried using Tabula to generate the data to a more clear table, but because of the skewing the rows/columns don't always match. "2021" could be column B and the values in column C.

I also tried reversing it the whole process to match if a line is a certain label. I would train the network on a sentence like A long Label with multiple words 90 120 with class Long label and then be able to pass in any line to find if it is in fact a Long label and then proceed to get the two values. I got decent results in a first attempt, but again there is no way to know if "90" or "120" belongs to "2021".

What could be a good way to structure this data and fit it to what model?

Find value in roughly tabular structured textual data from PDF

0 Answers0