2

I have spent some time searching Google and wasn't able to find out what kind of optimization algorithm is best for binary classification when images are similar to one another.

I'd like to read some theoretical proofs (if any) to convince myself that particular optimization has better results over the rest.

And, similarly, what kind of optimizer is better for binary classification when images are very different from each other?

nbro
  • 39,006
  • 12
  • 98
  • 176
bit_scientist
  • 241
  • 1
  • 4
  • 15

3 Answers3

3

I have consistently found Adam to work very well but to tell you the truth I have not seen all that much difference in performance based on the optimizer. Other factor seem to have much more influence on the final model performance.In particular adjusting the learning rate during training can be very effective. Also saving the weights for the lowest validation loss and loading the model with those weights to make predictions works very well. Keras provides two callbacks that help you achieve this. Documentation is at https://keras.io/callbacks/. The ReduceLROnPlateau callback allows you to adjust the learning rate based on monitoring a metric. Typically validation loss is monitored. If the loss fails to reduce after N consecutive epochs(parameter patience) the learning rate is adjusted by a factor(parameter factor). You can think of training as descending into a valley which gets more and more narrow as you approach the bottom. If the learning rate does not adjust to this "narrowness" there is no way you will get to the very bottom. The other callback is ModelCheckpoint. This allows you to save the model( or just the weights) based on monitoring of a metric. Again usually validation loss is monitored and the parameter save_best_only is set to true. This saves the model with the lowest validation loss. That model can than be used to make predictions.

Gerry P
  • 694
  • 4
  • 10
1

The fact that images are similar to each other or the fact that you are using binray classification, don't give you a particular choice of Optimizer, when an optimization algorithm is developped, those information are not taken into account. What is taken into account is the nature of the function we want to optimize (Is it smooth, convex, strongly convex, are stochastic gradient noisy...) The most used optimizer by far is ADAM, under some assumptions on the boundness of the gradient of the objective function, this paper gives the convergence rate of ADAM, they also provide experimental to validate that ADAM is better then some other optimizers. Some other works propose to mix adam with nestrov mommentum acceleration.

hola
  • 381
  • 2
  • 10
1

If you are using a shallow neural network SGD would be better, ADAM optimizer will give you a soon overfitting. but be careful about choosing the learning rate.

SahaTib
  • 140
  • 1
  • 9