Use of CNNs to recognize digits is a reasonable approach as of this answer's writing, the effectiveness of which can be enhanced via Sensitive Error Correcting Output Codes, (2005, John Langford, Alina Beygelzimer) according to Shuo Yang et. al. in Deep Representation Learning with Target Coding.
Given a sufficient CNN depth, it can be trained to recognize the digits with the field name and underscore intact. Automatically removing them would generally only be possible after they are recognized using the CNN approach anyway.
Isolating the digits from the field name and the underscore would be accomplished via the same approach used to isolate the digits from one another. There is no reason to perform these two conceptual tasks in series and consume additional development time and resources, when a single deep CNN can locate the digits and dismiss the field name and underscore. This is similar to animal vision. A mosquito's visual pathway distinguishes an oncoming object from the background to avoid a swat using the same network it uses to recognize the object.
What would be helpful would be to normalize the input by finding a two dimensional plateau (much like a geological one) representing the distribution of black pixels in horizontal and vertical dimensions. Summing rows and columns of pixels and using a heuristic algorithm to find a rectangle outside of which there is only noise may be sufficient. Then trim and scale, which removes redundancy and may improve CNN training speed, accuracy, and/or reliability by distributing the image more widely over the input layer and likely reducing the layer count by at least one.
Another approach is to use overall features of the form in which the field resides to rotate, position, and scale the image so that the field name and underscore is within a fraction of a pixel from an expected location on the form. In this case training a separate CNN is less redundant and has a higher return on development investment. In such a design, the field name and underscore can be removed by subtraction, however this may disturb field value recognition because the handwriting may overlap items blanked. It will require experimentation to determine the accuracy and reliability hit (diminution) from such disturbance.
Addendum in Response to Comment
For the first approach, a Field Name and Underscore Superimposer will need to be designed and implemented to transform the training example set features, the pixel arrays, but preserve the labels corresponding to each example. The hyper-parameter values that work best without the superimposition may need modification to work best with it.