How to reduce loss of Bi-LSTM handwriting recognition model?

Question

I am currently training an bi-LSTM model which predicts the handwriting of an individual. I am hitting a current min loss of 1.2 and I think it is not a problem with the model because I copied a model used by this study, which uses the IAM dataset as training data which is what I am currently using.

The way I am looking at this problem is that maybe my image preparation was not the same as said in the research. I prepared the image by changing the size of the words preserving the aspect ratio copied into a blank black canvas on the center. which makes my images normalized to a single size.

As said in the study

The built network takes in a variable width image as an input, where the length of the image is sixty pixels

So I think I am doing something wrong here or just misunderstanding somethings.

Is it possible maybe that this model accepts images with variable widths, or is it just doing what I did?

Also, the study had a loss of under 1 in just 5 epochs. Which I can hardly hit with mine.

And does the image's scale a factor here? If so, can I improve it by introducing data augmentation in my dataset?

A sample of the dataset:

Below, is the summary of my model made in keras

Model: "handwriting_recognizer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 image (InputLayer)             [(None, 360, 60, 1)  0           []                               
                                ]                                                                 
                                                                                                  
 Conv1 (Conv2D)                 (None, 360, 60, 64)  1664        ['image[0][0]']                  
                                                                                                  
 batchnorm1 (BatchNormalization  (None, 360, 60, 64)  256        ['Conv1[0][0]']                  
 )                                                                                                
                                                                                                  
 pool1 (MaxPooling2D)           (None, 180, 30, 64)  0           ['batchnorm1[0][0]']             
                                                                                                  
 Conv2 (Conv2D)                 (None, 180, 30, 64)  36928       ['pool1[0][0]']                  
                                                                                                  
 batchnorm2 (BatchNormalization  (None, 180, 30, 64)  256        ['Conv2[0][0]']                  
 )                                                                                                
                                                                                                  
 pool2 (MaxPooling2D)           (None, 90, 15, 64)   0           ['batchnorm2[0][0]']             
                                                                                                  
 Conv3 (Conv2D)                 (None, 90, 15, 256)  147712      ['pool2[0][0]']                  
                                                                                                  
 Conv4 (Conv2D)                 (None, 90, 15, 256)  590080      ['Conv3[0][0]']                  
                                                                                                  
 batchnorm3 (BatchNormalization  (None, 90, 15, 256)  1024       ['Conv4[0][0]']                  
 )                                                                                                
                                                                                                  
 pool3 (MaxPooling2D)           (None, 45, 7, 256)   0           ['batchnorm3[0][0]']             
                                                                                                  
 Conv5 (Conv2D)                 (None, 45, 7, 512)   1180160     ['pool3[0][0]']                  
                                                                                                  
 Conv6 (Conv2D)                 (None, 45, 7, 512)   2359808     ['Conv5[0][0]']                  
                                                                                                  
 batchnorm4 (BatchNormalization  (None, 45, 7, 512)  2048        ['Conv6[0][0]']                  
 )                                                                                                
                                                                                                  
 pool4 (MaxPooling2D)           (None, 22, 3, 512)   0           ['batchnorm4[0][0]']             
                                                                                                  
 reshape (Reshape)              (None, 22, 1536)     0           ['pool4[0][0]']                  
                                                                                                  
 dense1 (Dense)                 (None, 22, 64)       98368       ['reshape[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, 22, 64)       0           ['dense1[0][0]']                 
                                                                                                  
 bidirectional (Bidirectional)  (None, 22, 256)      197632      ['dropout[0][0]']                
                                                                                                  
 bidirectional_1 (Bidirectional  (None, 22, 256)     394240      ['bidirectional[0][0]']          
 )                                                                                                
                                                                                                  
 bidirectional_2 (Bidirectional  (None, 22, 128)     164352      ['bidirectional_1[0][0]']        
 )                                                                                                
                                                                                                  
 dropout_1 (Dropout)            (None, 22, 128)      0           ['bidirectional_2[0][0]']        
                                                                                                  
 label (InputLayer)             [(None, None)]       0           []                               
                                                                                                  
 dense2 (Dense)                 (None, 22, 85)       10965       ['dropout_1[0][0]']              
                                                                                                  
 ctc_loss (CTCLayer)            (None, 22, 85)       0           ['label[0][0]',                  
                                                                  'dense2[0][0]']                 
                                                                                                  
==================================================================================================
Total params: 5,185,493
Trainable params: 5,183,701
Non-trainable params: 1,792
__________________________________________________________________________________________________

I've added a link to the paper that I think you were referring to. Make sure that's the correct link. — nbro, Mar 10 '22 at 09:48

How to reduce loss of Bi-LSTM handwriting recognition model?

0 Answers0