Momentum was big. It allowed several steps to be evened out so that most of the motion in the weights was in the direction of the optimum. It operates against sequential measurements of the error. This means that several estimates of the gradient give better local picture of the loss-surface.
The error has a magnitude, so the gradient has both direction and magnitude. It tells us the direction to go, but also how far to go. The space is complex, so many momentum methods smooth the magnitude and direction by combining thousands of gradient estimates.
Dataset distillation is interesting because it can require a 10x larger network to learn a task, then distill that learning into a 1x network to do the task. This is a universal-to-specific transformation, using a global approximator to find the local landscape that works is very different than building an approximator for use only within that landscape. The converged space of both networks is the same, but one is contrived in vastly fewer parameters. The gradient in the around the optimum, the perturbed gradient given the training data, tells how to perform the simplification.
Saliency maps use back-propagation (single-pass gradient on fully trained networks) to infer interior structure and operations of complex neural networks.