1

I have multiple pictures that look exactly like the one below this text. I'm trying to train CNN to read the digits for me. Problem is isolating the digits. They could be written in any shape, way, and position that person who is writing them wanted to. I thought of maybe training another CNN to recognize the position/location of the digits, but I'm not sure how to approach the problem. But, I need to get rid of that string and underline. Any clue would be a great one. Btw. I would love to get the 28x28 format just like the one in MNIST.

Thanks up front.

enter image description here

Igor
  • 77
  • 4

2 Answers2

1

I think one approach you can try to segment the digits and Connected Components Labeling (https://en.wikipedia.org/wiki/Connected-component_labeling). With it, you'll end up with a label for each letter and then you can try to find the convex hull of the letter. After that, just crop a square for each convex hull and input it to your CNN. Notice that it will only if there is at least one pixel between the letters...

  • @Júlio César Batista Isn't that just looping over `contours` in OpenCV and removing them. Just this time, I have some labels thanks to the CCL. I really don't think it should work. What if there is some other element in front of that string, and It is big enough so that I can't close it with `dilation` or `erodion` without removing the parts that I care about. – Igor Dec 17 '18 at 11:57
0

Use of CNNs to recognize digits is a reasonable approach as of this answer's writing, the effectiveness of which can be enhanced via Sensitive Error Correcting Output Codes, (2005, John Langford, Alina Beygelzimer) according to Shuo Yang et. al. in Deep Representation Learning with Target Coding.

Given a sufficient CNN depth, it can be trained to recognize the digits with the field name and underscore intact. Automatically removing them would generally only be possible after they are recognized using the CNN approach anyway.

Isolating the digits from the field name and the underscore would be accomplished via the same approach used to isolate the digits from one another. There is no reason to perform these two conceptual tasks in series and consume additional development time and resources, when a single deep CNN can locate the digits and dismiss the field name and underscore. This is similar to animal vision. A mosquito's visual pathway distinguishes an oncoming object from the background to avoid a swat using the same network it uses to recognize the object.

What would be helpful would be to normalize the input by finding a two dimensional plateau (much like a geological one) representing the distribution of black pixels in horizontal and vertical dimensions. Summing rows and columns of pixels and using a heuristic algorithm to find a rectangle outside of which there is only noise may be sufficient. Then trim and scale, which removes redundancy and may improve CNN training speed, accuracy, and/or reliability by distributing the image more widely over the input layer and likely reducing the layer count by at least one.

Another approach is to use overall features of the form in which the field resides to rotate, position, and scale the image so that the field name and underscore is within a fraction of a pixel from an expected location on the form. In this case training a separate CNN is less redundant and has a higher return on development investment. In such a design, the field name and underscore can be removed by subtraction, however this may disturb field value recognition because the handwriting may overlap items blanked. It will require experimentation to determine the accuracy and reliability hit (diminution) from such disturbance.

Addendum in Response to Comment

For the first approach, a Field Name and Underscore Superimposer will need to be designed and implemented to transform the training example set features, the pixel arrays, but preserve the labels corresponding to each example. The hyper-parameter values that work best without the superimposition may need modification to work best with it.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
  • What would be your preferred way of solving the problem. I'm interested in your first approach. So, if I build network big enough it could understand field name and underscore like some kind of a constant, that would not impact on the result? In general one bounding box that will hold filed name, underscore and number in one scale and normalize that as input and use it for training. I'm not sure in this approach would it ask for more training examples and what else would change? But it sounds great. By the way, `labels` would still be only exact number values? – Igor Jan 08 '19 at 11:34
  • I guess that by `Field Name and Underscore Superimposer will need to be designed and implemented to transform the training example set features, the pixel arrays, but preserve the labels` you really mean: the data set will have different features because images would include a Field Name and Underscore Superimposer but labels would stay the same of course. On the other hand, the new feature of data set would not have such a great impact on the hyper-parameters because basically, every picture would have the same new feature. – Igor Jan 08 '19 at 12:25
  • Also if a model is trained on 28x28 image, wouldn't that approach maybe had its pitfalls as scaling to 28x28 would make the image more unreadable. The dimensions would need to go up? – Igor Jan 08 '19 at 12:25