9

I've pondered this for a while without developing an intuition for the math behind the cause of this.

So what causes a model to need a low learning rate?

nbro
  • 39,006
  • 12
  • 98
  • 176
JohnAllen
  • 217
  • 1
  • 6
  • I wondered about it too and I am curious why RNNs have a smaller learning rate than CNNs. From what I know, model complexity (deepness) and/or huge datasets require a finer hyperparameter for the lr. – iustin Mar 31 '19 at 22:11

1 Answers1

6

Gradient Descent is a method to find the optimum parameter of the hypothesis or minimize the cost function.

formula where alpha is learning rate

If the learning rate is high then it can overshoot the minimum and can fail to minimize the cost function. enter image description here

hence result in a higher loss.

enter image description here

Since Gradient descent can only find local minimum so, the lower learning rate may result in bad performance. To do so, it is better to start with the random value of the hyperparameter can increase model training time but there are advanced methods such as adaptive gradient descent can manage the training time.

There are lots of optimizer for the same task but no optimizer is perfect. It depends on some factors

  1. size of training data: as the size of the training data increases training time for model increases. If you want to go with less training model time you can choose a higher learning rate but may result in bad performance.
  2. Optimizer(gradient descent) will be slow down whenever the gradient is small then it is better to go with a higher learning rate.

PS. It is always better to go with different rounds of gradient descent

Posi2
  • 358
  • 2
  • 16
  • 5
    This is good start, as it shows the difference between low and high learning rates in general. You also need to explain why the good learning rate varies depending on the task - and the OP was specifically asking why some problems require a lower learning rate than others – Neil Slater Apr 01 '19 at 13:29
  • 1
    That's a good point. I have edited it. Since there is not a specific problem is mention I am going with general one. – Posi2 Apr 01 '19 at 14:13
  • 2
    I still think that this does not answer the question. The OP is not asking about the optimiser or data, it is asking about the model. How does the model (its architecture, number of parameters, etc.) affect the learning rate? I think this is the actual question, which you do not answer. Everything else is quite irrelevant to the question and will only confuse readers that can't distinguish between these concepts. – nbro Apr 01 '19 at 14:36
  • Thanks for the feedback. Irrespective of the model architecture as the number of the parameter, size of data and range of the data (solution use normalized data) is high result in the higher training time so according to it, we should change the learning rate. This applies for the model such as linear regression, logistic regression, SVM etc since they use GD for optimization. Any response is always welcome :) – Posi2 Apr 01 '19 at 16:44
  • Any proof that assesses your claim "irrespective of the model architecture"? This answer still does not answer the OP question. You're answering to the question "how does the learning rate change in general, depending on the machine learning setting" (and your answer is not exhaustive, of course, because it does not mention "how the learning rate changes depending on the model", i.e. the actual question). – nbro Apr 01 '19 at 17:23
  • What is machine learning setting? Gradient descent is algorithm which takes some parameter, Algorithm does not change with change in model it remains same. Of course there is change in number of parameter, size, range of value etc. – Posi2 Apr 01 '19 at 18:28
  • All your results are based on the assumption that the loss curve is perfectly convex which is very rarely the case, and that is why adaptive Learning algorithms are used. –  Apr 01 '19 at 19:13
  • I mentioned that gradient descent find out local minimum by considering non convex curve and there are also other better optimizer are present. – Posi2 Apr 02 '19 at 04:02