What all does the gradient tells us other than the direction to move parameters?

Question

Gradients are used in optimization algorithms.

I know that a gradient gives us information about the direction in which one needs to update the weights of a neural network. We need to travel in the opposite direction of the gradient to get optimal values.

Thus the gradient provides direction to update parameters.

Is it the only information provided by the gradient? Or does it provide any other information that helps in the training process?

gradient magnitudes also tell you how much you need to update the weights, in addition to the direction (i.e., increment or decrement). — SpiderRico, Aug 19 '21 at 03:18
@SpiderRico If it is contributing in terms of magnitude, I feel the learning rate is decreasing the magnitude... So, I'm feeling that the magnitude of the gradient is just nominal.. Is it true? — hanugm, Aug 19 '21 at 09:02
Yeah you can adjust for that with learning rate. However, note that magnitude of a gradient is not the same for every dimension. So, some dimensions would still be changed more than the others. — SpiderRico, Aug 19 '21 at 12:09
@SpiderRico Feel free to provide a formal answer if you think you have enough information to at least partially answer the question. — nbro, Aug 19 '21 at 12:56

EngrStudent · Answer 1 · 2022-11-28T15:40:50.057

Momentum was big. It allowed several steps to be evened out so that most of the motion in the weights was in the direction of the optimum. It operates against sequential measurements of the error. This means that several estimates of the gradient give better local picture of the loss-surface.

The error has a magnitude, so the gradient has both direction and magnitude. It tells us the direction to go, but also how far to go. The space is complex, so many momentum methods smooth the magnitude and direction by combining thousands of gradient estimates.

Dataset distillation is interesting because it can require a 10x larger network to learn a task, then distill that learning into a 1x network to do the task. This is a universal-to-specific transformation, using a global approximator to find the local landscape that works is very different than building an approximator for use only within that landscape. The converged space of both networks is the same, but one is contrived in vastly fewer parameters. The gradient in the around the optimum, the perturbed gradient given the training data, tells how to perform the simplification.

Saliency maps use back-propagation (single-pass gradient on fully trained networks) to infer interior structure and operations of complex neural networks.

What all does the gradient tells us other than the direction to move parameters?

1 Answers1