1

I am currently experimenting with the U-Net. I am doing semantic segmentation on the 2018 Data Science Bowl dataset from Kaggle without any data augmentation.

In my experiments, I am trying different hyper-parameters, like using Adam, mini-batch GD (MBGD), and batch normalization. Interestingly, all models with BN and/or Adam improve, while models without BN and with MBGD do not.

How could this be explained? If it is due to the internal covariate shift, the Adam models without BN should not improve either, right?

In the image below is the binary CE (BCE) train loss of my three models where the basic U-Net is blue, the basic U-Net with BN after every convolution is green, and the basic U-Net with Adam instead of MBGD is orange. The learning rate used in all models is 0.0001. I have also used other learning rates with worse results.

training loss (BCE) of basic U-Net (blue), basic U-Net with BN (green) and basic U-Net with Adam(orange)

nbro
  • 39,006
  • 12
  • 98
  • 176
Bert Gayus
  • 545
  • 3
  • 12

1 Answers1

1

Well, some time ago I also faced the same issue in the semantic segmentation task. Batch normalization is expected to improve convergence, because the normalization of activations prevents the explosion of the gradients magnitude and leads to more steady convergence.

Adam is an adaptive optimizer with momentum and division by the weighted sum of gradients on previous iterations squared. https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c.

The loss surfaces of the neural networks is a difficult and poorly understood topic in the present. I suppose, that the poor convergence of SGD is caused by the roughness of loss surface, where the gradient makes big leaps, and jumps over the mimima. The adaptive learing strategy of the Adam, on the other hand, allows to dive into the valleys.