Is there a recommended resource that can provide a detailed overview of the gradient norm?

Question

When it comes to the concept of "Gradient Norm," it can be challenging to find a widely recognized and clearly defined resource that offers a comprehensive explanation. While many search results include insights from machine learning experts or references to papers that touch upon gradient norm, there isn't a single, definitive source that delves into the topic extensively.

Welcome to AI SE! Please provide the research that you have done in an attempt to answer your own questions (links to the resources you mention) within the post. This will help others understand what you have or have not seen! — Robin van Hoorn, May 27 '23 at 19:15

score 0 · Answer 1 · answered May 27 '23 at 20:27

The norm is a mathematical operation that can be applied to vectors, or matrices, informally measuring the "length" of such mathematical objects.

Since the gradient $g=\nabla_\theta f(x;\theta)$ of some continuous function $f:\mathbb R^N\to \mathbb R^M$ w.r.t. some parameters $\theta$ can be either a scalar (if $N=M=1$), vector (if $N>1, M=1$), or matrix (if $N>1,M>1$) the "gradient norm" is just the norm operation applied to $g$.

If the gradient is a scalar, the norm is just its absolute value: $|g|$.
For the vector case, the norm measures the length or magnitude of the vector. There are various notions, resulting in different norms. The most used (especially in the context of gradients) is the Euclidean norm (also called $l_2$-norm): $\|g\|_2$.
For the matrices the concept of vector norm is extended. For example, the Frobenius norm, $\|g\|_F$, is the equivalent of the Euclidean norm but for matrices.

Now in the context of DL, the gradient is usually a list of matrices and vectors. So when referring to "gradient norm" one usually means the global $l_2$-norm of all the gradients, computed as follows (see tf.linalg.global_norm): $$ \|G\|_2 = \sqrt\sum_{g\in G}\|g\|_2^2, $$ which is the square root of the sum of the squared euclidean norms, for each gradient $g$ in the list $G$ of gradients.

In DL, the norm of the gradients can serve two main purposes: 1) to monitor the gradients to see if vanishing/exploding grads occurs during training, and 2) to be applied with optimization algorithms in the context of gradient clipping, thus scaling the norm of the gradients: this can be done individually on each $g$, and globally on $G$ (gradient list).

Is there a recommended resource that can provide a detailed overview of the gradient norm?

1 Answers1