4

What are the Ubuntu options to create CSV tables from an image:

table of figures

Ideally, options would be simple and quick. Image is from Yield Curve article

Zanna
  • 70,465
gatorback
  • 5,785
  • 9
  • 40
  • 65
  • Hi, this question would be better asked on stackoverflow.com – NGRhodes Aug 14 '19 at 18:01
  • 1
    @NGRhodes Thanks for the comment. I had intended (and have now updated) the question to be constrained to Ubuntu solutions. The concern is that if posted elsewhere the solution may not work on Ubuntu – gatorback Aug 14 '19 at 18:45
  • 1
    You would need to OCR the image after choosing an OCR tool https://help.ubuntu.com/community/OCR , manually confirm the OCR output is correct before conversion of text to CSV, replacing tabs or multiple spaces with commas. – K7AAY Aug 14 '19 at 20:35
  • 1
    @K7AAY If you have a command line example, please consider posting it as formal answer. – gatorback Aug 15 '19 at 00:10
  • 1
    Command line answers are dependent on the OCR tool chosen. My experience with Linux OCR is lacking, so I just provided the outline in the comment, not wanting to lead a fellow Gator astray. You asked for options, so I provided a link to OCR options. I hope someone will take this cup from me and provide an answer well-seasoned by experience. – K7AAY Aug 15 '19 at 16:01

1 Answers1

3

I wrote a small Python script that can do what you want.

For the OCR part you would need tesseract. You can install it running:

sudo apt install tesseract-ocr

Then, run tesseract to create a txt file with the image-read data. I am naming this file tesseract_output (tesseract will add the .txt extension), you can name it as you wish.

tesseract /path/to/image /path/to/tesseract_output

Then copy and paste the following script and save it to a file ending with .py (for example script.py).

import csv

def split_list(l, n):
    """Split list l in size-n lists.
    Returns a list containing the size-n lists.
    """
    splitted = []

    for i in range(0, len(l), n):
        splitted.append(l[i:i + n])

    return splitted

######################   USER INPUT   ######################
input_file = '/path/to/tesseract_output.txt'
output_file = '/path/to/table.csv'
rows = 10
delimiter = ';'
############################################################

# read input file
with open(input_file, 'r') as f:
    data = f.readlines()

# remove trailing whitespace and newlines
data = list(map(lambda x: x.strip(), data))
# remove empty elements 
data = list(filter(None, data))
# split data to rows-sized lists
data = split_list(data, rows)
# shape data
data = list(map(list, zip(*data)))

# write to csv
with open(output_file, 'w', newline='\n') as f:
    wr = csv.writer(f, delimiter=delimiter)
    for i in range(len(data)):
        wr.writerow(data[i])

For the script to work, you have to enter the following in the USER INPUT section:

  • input_file: the complete path to the tesseract output.

  • output_file: the complete path to the final csv file.

  • rows: the number of table rows. In your example image it's 10.

  • delimiter: the delimiter to be used in the csv. Here I use ;. You can use any 1-character string you need.

Run the file:

python /path/to/your/script.py

You should now have a csv with the following contents:

Date;1Mo;2Mo;3Mo;6 Mo;4Yr;2Yr;3Yr;5Yr;TYr;10 Yr;20Yr;30 Yr
08/01/19;2.14;214;2.07;2.04;1.88;1.73;1.67;1.68;177;1.90;2.21;2.44
08/02/19;2.44;2.12;2.08;2.02;1.85;1.72;1.67;1.66;1.75;1.86;2.16;2.39
08/05/19;2.07;2.08;2.05;1.99;1.78;1.59;1.55;1.55;1.63;1.75;2.07;2.30
08/06/19;2.05;2.08;2.05;2.00;1.80;1.60;1.54;1.53;1.62;1.73;2.03;2.25
08/07/19;2.02;2.04;2.02;1.95;1.75;1.59;1.51;1.52;1.60;171;2.01;2.22
08/08/19;2.09;2.07;2.02;1.96;1.79;1.62;1.54;1.54;1.62;1.72;2.02;2.25
08/09/19;2.05;2.06;2.00;1.95;1.78;1.63;1.58;1.57;1.65;174;2.03;2.26
08/12/19;2.09;2.06;2.00;1.94;1.75;1.58;1.51;1.49;1.56;1.65;1.92;2.14
08/13/19;2.05;2.04;2.00;1.96;1.86;1.66;1.60;1.57;1.62;1.68;1.94;2.15

CAUTION

As you can see, the result is satisfactory, but depends on tesseract's output. It is almost certain that tesseract won't detect everything correctly, as you can easily see in the csv output. You will have to compare the results with the original image and fix them manually, either in the tesseract output or in the csv output at the end.

Also, in the script I am taking care of trailing whitespace and redundant newlines that tesseract spits out, which works fine for your example image. However, if a table cell was empty, it would be completely removed, effectively destroying the whole table structure. In this case, if I were you, I would edit the tesseract_output.txt file and manually change the empty cells to containing a -, so it wouldn't get deleted.