7

I am using Google's OCR to extract text from images, like receipts and invoices.

Whare examples of techniques used to make sense of the text? For example, I would like to extract the date, name of the business, address, total amount, etc.

Before marking this question "too broad", if someone can please direct me to the right set of algorithms the industry uses for machine learning will be great.

nbro
  • 39,006
  • 12
  • 98
  • 176
Abhay Naik
  • 179
  • 2
  • This is an old question, but it's unclear (to me at least) whether you're asking for 1. OCR techniques or 2. techniques that "give a meaning" to the extracted text (which was extracted with some OCR technique). From the title, it seems that you're looking for algorithms that solve the OCR problem (so it should be 1), but from your comment below under one answer it seems that you don't just want to perform OCR but you somehow want to associate a meaning with the extracted text. So, can you please clarify exactly what you're asking here? – nbro Jan 08 '21 at 01:11
  • To make things even worse, you say that you're using Google's OCR, which makes me think that you're not looking for OCR techniques, because, otherwise, why would you care about specific OCR techniques if you already have one? You should probably have also linked to the tool/library you're referring to when you say "Google's OCR" – nbro Jan 08 '21 at 01:18

2 Answers2

4

Last semester I, along with my team, made a project on OCR.
Note: I am assuming the data set for your pictures has white background with black (or some other dark text) on it.
These are the overview steps we followed:
enter image description here
Pre-processing includes grayscale conversion, noise reduction, binarization and skew detection.
Next step was segmentation. This process extracts the individual characters from the image. Histogram taken along the y-axis divided the image into lines. This is followed by histogram along the x-axis which divided them into words and further into characters. At the end of the step, we used Savgol filter to smooth the curves of the histogram.

Next step was feature extraction. This is the most important step. The accuracy of your code depends on how well your features are.
We used the following features:

  • Crossing: Counting number of transitions between foreground and background. We used two diagonal lines, two horizontal and one vertical line. You can used any number you want.
  • Zoning: Whole character region is divided into 16 zones, and density of each zone is measured.
  • Projection Histogram: Each character has unique (almost) vertical and horizontal histogram signature.
  • Other features include number of endpoints in the character, number of loops and horizontal/vertical line count.

We used three different classification algorithms for our project. They were KNN (K-Nearest Neighbours), Artificial Neural Network (ANN) and Extra Tree classification. Their F1 score was 0.84, 0.82 and 0.77 respectively.
For training, you will need to find datasets. Many data sets for OCR was available online. Make sure you are using good ones.

Ugnes
  • 2,023
  • 1
  • 13
  • 26
1

An interesting question, I think the algorithm used for OCR is "Logistic Regression" or "Decision Tree" in multiple steps.

The steps can be

  1. Image Classification - In this step, the images are classified into "with or without" text.
  2. Text Detection - In this step, the images with text are taken divided into blocks and the blocks are classified into "with or without" text.
  3. Character Detection - In this step, the blocks with text are taken and divided into smaller boxes of single characters and compare with a database of characters.

The database is built using the crowdsourced "captcha" project.