I am currently training an bi-LSTM model which predicts the handwriting of an individual. I am hitting a current min loss of 1.2 and I think it is not a problem with the model because I copied a model used by this study, which uses the IAM dataset as training data which is what I am currently using.
The way I am looking at this problem is that maybe my image preparation was not the same as said in the research. I prepared the image by changing the size of the words preserving the aspect ratio copied into a blank black canvas on the center. which makes my images normalized to a single size.
As said in the study
The built network takes in a variable width image as an input, where the length of the image is sixty pixels
So I think I am doing something wrong here or just misunderstanding somethings.
Is it possible maybe that this model accepts images with variable widths, or is it just doing what I did?
Also, the study had a loss of under 1 in just 5 epochs. Which I can hardly hit with mine.
And does the image's scale a factor here? If so, can I improve it by introducing data augmentation in my dataset?
A sample of the dataset:
Below, is the summary of my model made in keras
Model: "handwriting_recognizer"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
image (InputLayer) [(None, 360, 60, 1) 0 []
]
Conv1 (Conv2D) (None, 360, 60, 64) 1664 ['image[0][0]']
batchnorm1 (BatchNormalization (None, 360, 60, 64) 256 ['Conv1[0][0]']
)
pool1 (MaxPooling2D) (None, 180, 30, 64) 0 ['batchnorm1[0][0]']
Conv2 (Conv2D) (None, 180, 30, 64) 36928 ['pool1[0][0]']
batchnorm2 (BatchNormalization (None, 180, 30, 64) 256 ['Conv2[0][0]']
)
pool2 (MaxPooling2D) (None, 90, 15, 64) 0 ['batchnorm2[0][0]']
Conv3 (Conv2D) (None, 90, 15, 256) 147712 ['pool2[0][0]']
Conv4 (Conv2D) (None, 90, 15, 256) 590080 ['Conv3[0][0]']
batchnorm3 (BatchNormalization (None, 90, 15, 256) 1024 ['Conv4[0][0]']
)
pool3 (MaxPooling2D) (None, 45, 7, 256) 0 ['batchnorm3[0][0]']
Conv5 (Conv2D) (None, 45, 7, 512) 1180160 ['pool3[0][0]']
Conv6 (Conv2D) (None, 45, 7, 512) 2359808 ['Conv5[0][0]']
batchnorm4 (BatchNormalization (None, 45, 7, 512) 2048 ['Conv6[0][0]']
)
pool4 (MaxPooling2D) (None, 22, 3, 512) 0 ['batchnorm4[0][0]']
reshape (Reshape) (None, 22, 1536) 0 ['pool4[0][0]']
dense1 (Dense) (None, 22, 64) 98368 ['reshape[0][0]']
dropout (Dropout) (None, 22, 64) 0 ['dense1[0][0]']
bidirectional (Bidirectional) (None, 22, 256) 197632 ['dropout[0][0]']
bidirectional_1 (Bidirectional (None, 22, 256) 394240 ['bidirectional[0][0]']
)
bidirectional_2 (Bidirectional (None, 22, 128) 164352 ['bidirectional_1[0][0]']
)
dropout_1 (Dropout) (None, 22, 128) 0 ['bidirectional_2[0][0]']
label (InputLayer) [(None, None)] 0 []
dense2 (Dense) (None, 22, 85) 10965 ['dropout_1[0][0]']
ctc_loss (CTCLayer) (None, 22, 85) 0 ['label[0][0]',
'dense2[0][0]']
==================================================================================================
Total params: 5,185,493
Trainable params: 5,183,701
Non-trainable params: 1,792
__________________________________________________________________________________________________