Questions tagged [learning-rate]

For questions related to the concept of learning rate (of an optimization algorithm, such as gradient descent) in machine learning.

32 questions
9
votes
2 answers

Is there an ideal range of learning rate which always gives a good result almost in all problems?

I once read somewhere that there is a range of learning rate within which learning is optimal in almost all the cases, but I can't find any literature about it. All I could get is the following graph from the paper: The need for small learning rates…
9
votes
1 answer

What causes a model to require a low learning rate?

I've pondered this for a while without developing an intuition for the math behind the cause of this. So what causes a model to need a low learning rate?
8
votes
1 answer

Why is the learning rate generally beneath 1?

In all examples I've ever seen, the learning rate of an optimisation method is always less than $1$. However, I've never found an explanation as to why this is. In addition to that, there are some cases where having a learning rate bigger than 1 is…
6
votes
1 answer

Should I be decaying the learning rate and the exploration rate in the same manner?

Should I be decaying the learning rate and the exploration rate in the same manner? What's too slow and too fast of an exploration and learning rate decay? Or is it specific from model to model?
4
votes
2 answers

Is there a way to translate the concept of batch size into reinforcement learning?

I am using a neural network as my function approximator for reinforcement learning. In order to get it to train well, I need to choose a good learning rate. Hand-picking one is difficult, so I read up on methods of programmatically choosing a…
4
votes
2 answers

Is stable learning preferable to jumps in accuracy/loss

A stable/smooth learning validation curve often seems to keep improving over more epochs than an unstable learning curve. My intuition is that dropping the learning rate and increasing the patience of a model that produces a stable learning curve…
Oliver P
  • 141
  • 2
3
votes
0 answers

Would a different learning rate for every neuron and layer mitigate or solve the vanishing gradient problem?

I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation…
3
votes
1 answer

In Q-learning, shouldn't the learning rate change dynamically during the learning phase?

I have the following code (below), where an agent uses Q-learning (RL) to play a simple game. What appears to be questionable for me in that code is the fixed learning rate. When it's set low, it's always favouring the old Q-value over the…
2
votes
1 answer

Why can the learning rate make the loss increase in stochastic gradient descent?

In Deep Learning by Goodfellow et al., I came across the following line on the chapter on Stochastic Gradient Descent (pg. 287): The main question is how to set $\epsilon_0$. If it is too large, the learning curve will show violent oscillations,…
2
votes
0 answers

Do learning rate schedulers conflict with or prevent convergence of the Adam optimiser?

An article on https://spell.ml says Because Adam manages learning rates internally, it's incompatible with most learning rate schedulers. Anything more complicated than simple learning warmup and/or decay will put the Adam optimizer to "complete"…
Jack G
  • 21
  • 3
2
votes
2 answers

How does $\alpha$ affect the convergence of the TD algorithm?

In Temporal-Difference Learning, we update our value function by $V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right)$ If we choose a constant $\alpha$, will the algorithm…
2
votes
0 answers

Has the idea of using different learning rates for different layers been explored in the literature?

I wonder whether there are heuristic rules for the optimal selection of learning rates for different layers. I expect that there is no general recipe, but probably there are some choices that may be beneficial. The common strategy uses the same…
2
votes
1 answer

How does the learning rate $\alpha$ vary in stationary and non-stationary environments?

In Sutton and Barto's book (Chapter 6: TD learning, 2nd edition), he mentions two ways of updating value function: Monte Carlo method: $V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)]$. TD(0) method: $V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} +…
2
votes
0 answers

Why does learning rate reduce train-test generalization gap?

In this blog post: http://www.argmin.net/2016/04/18/bottoming-out/ Prof Recht shows two plots: He says one of the reasons the plot below has a lower train-test gap is because that model was trained with a lower learning rate (and he also manually…
2
votes
1 answer

Autoencoder network for feature selection not converging

I am training an undercomplete autoencoder network for feature selection. I am using one hidden layer in the encoder and decoder networks each. The ELU activation function is used for each layer. For optimization, I am using the ADAM optimizer.…
1
2 3