4

A stable/smooth learning validation curve often seems to keep improving over more epochs than an unstable learning curve. My intuition is that dropping the learning rate and increasing the patience of a model that produces a stable learning curve could lead to better validation fit.

The counter argument is that jumps in the curve could mean that the model has just learned something significant, but they often jump back down or tail off after that.

Is one better than the other? Is it possible to take aspects of both to improve learning?

Shayan Shafiq
  • 350
  • 1
  • 4
  • 12
Oliver P
  • 141
  • 2

2 Answers2

2

There is an approach to machine learning, called Simulated Annealing, which varies the rate: starting from a large rate, it is slowly reduced over time. The general idea is that the initial larger rate will cover a broader range, while the increasingly lower rate then produces a less 'erratic' climb towards a maximum.

If you only use a low rate, you risk getting stuck in a local maximum, while too large a rate will not find the best solution but might end up close to one. Adjusting the rate gives you the best of both.

Oliver Mason
  • 5,322
  • 12
  • 32
  • Thank you Oliver. It hadn't occurred to me that having too low a learning rate could result in getting stuck in a local maximum. I'm aware of learning decay in Tensorflow and have been using it but not aware how important it is to use both appropriately. Am I right in saying we would want big jumps early on and then a stable learning later as it optimises on hopefully the best maxima? – Oliver P Sep 30 '20 at 13:33
  • 1
    @OliverP Yes, exactly. – Oliver Mason Sep 30 '20 at 15:35
  • That should have been local minimum, @OliverP – David Hoelzer Apr 23 '21 at 14:32
  • @DavidHoelzer No, local maximum is correct. – Oliver Mason Apr 25 '21 at 11:36
1

If you have an erratic loss landscape, it can lead to an unstable learning curve. Thus, it's always better to choose a simpler function which creates a simple landscape. Sometimes even due to uneven dataset distribution, we can observe those jumps/irregularities in the training curve.

And yes, those jumps do mean it might've found something significant in the landscape. Those jumps can arise while the model is exploring the multiple local minima of the landscape.

During Machine Learning Optimization, we usually use algorithms like Stochastic Gradient Descent and Adam to find Local Minima's whereas approaches like Simulated Annealing find global minima. There have been multiple discussions around why to use local minima's instead of global minima. Some argue that local minima are just as useful as global minima in case of machine learning problems.

Thus, Stable Learning is preferable as it symbolizes that the model is converging to local minima.

References


You can read A Survey of Optimization Methods from a Machine Learning Perspective by Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhaoto et al. and read about all the optimization functions commonly used in Machine Learning.

Saurav Maheshkar
  • 756
  • 1
  • 7
  • 20