What are the Ubuntu options to create CSV tables from an image:
Ideally, options would be simple and quick. Image is from Yield Curve article
What are the Ubuntu options to create CSV tables from an image:
Ideally, options would be simple and quick. Image is from Yield Curve article
I wrote a small Python script that can do what you want.
For the OCR part you would need tesseract
. You can install it running:
sudo apt install tesseract-ocr
Then, run tesseract
to create a txt file with the image-read data. I am naming this file tesseract_output
(tesseract
will add the .txt
extension), you can name it as you wish.
tesseract /path/to/image /path/to/tesseract_output
Then copy and paste the following script and save it to a file ending with .py
(for example script.py
).
import csv
def split_list(l, n):
"""Split list l in size-n lists.
Returns a list containing the size-n lists.
"""
splitted = []
for i in range(0, len(l), n):
splitted.append(l[i:i + n])
return splitted
###################### USER INPUT ######################
input_file = '/path/to/tesseract_output.txt'
output_file = '/path/to/table.csv'
rows = 10
delimiter = ';'
############################################################
# read input file
with open(input_file, 'r') as f:
data = f.readlines()
# remove trailing whitespace and newlines
data = list(map(lambda x: x.strip(), data))
# remove empty elements
data = list(filter(None, data))
# split data to rows-sized lists
data = split_list(data, rows)
# shape data
data = list(map(list, zip(*data)))
# write to csv
with open(output_file, 'w', newline='\n') as f:
wr = csv.writer(f, delimiter=delimiter)
for i in range(len(data)):
wr.writerow(data[i])
For the script to work, you have to enter the following in the USER INPUT section:
input_file
: the complete path to the tesseract
output.
output_file
: the complete path to the final csv file.
rows
: the number of table rows. In your example image it's 10.
delimiter
: the delimiter to be used in the csv. Here I use ;
. You can use any 1-character string you need.
Run the file:
python /path/to/your/script.py
You should now have a csv with the following contents:
Date;1Mo;2Mo;3Mo;6 Mo;4Yr;2Yr;3Yr;5Yr;TYr;10 Yr;20Yr;30 Yr
08/01/19;2.14;214;2.07;2.04;1.88;1.73;1.67;1.68;177;1.90;2.21;2.44
08/02/19;2.44;2.12;2.08;2.02;1.85;1.72;1.67;1.66;1.75;1.86;2.16;2.39
08/05/19;2.07;2.08;2.05;1.99;1.78;1.59;1.55;1.55;1.63;1.75;2.07;2.30
08/06/19;2.05;2.08;2.05;2.00;1.80;1.60;1.54;1.53;1.62;1.73;2.03;2.25
08/07/19;2.02;2.04;2.02;1.95;1.75;1.59;1.51;1.52;1.60;171;2.01;2.22
08/08/19;2.09;2.07;2.02;1.96;1.79;1.62;1.54;1.54;1.62;1.72;2.02;2.25
08/09/19;2.05;2.06;2.00;1.95;1.78;1.63;1.58;1.57;1.65;174;2.03;2.26
08/12/19;2.09;2.06;2.00;1.94;1.75;1.58;1.51;1.49;1.56;1.65;1.92;2.14
08/13/19;2.05;2.04;2.00;1.96;1.86;1.66;1.60;1.57;1.62;1.68;1.94;2.15
CAUTION
As you can see, the result is satisfactory, but depends on tesseract
's output. It is almost certain that tesseract
won't detect everything correctly, as you can easily see in the csv output. You will have to compare the results with the original image and fix them manually, either in the tesseract output or in the csv output at the end.
Also, in the script I am taking care of trailing whitespace and redundant newlines that tesseract
spits out, which works fine for your example image. However, if a table cell was empty, it would be completely removed, effectively destroying the whole table structure. In this case, if I were you, I would edit the tesseract_output.txt
file and manually change the empty cells to containing a -
, so it wouldn't get deleted.