How does deep learning overcome overfitting?

Question

From Berkeley CS182, SP22: https://cs182sp22.github.io/assets/lecture_slides/2022.01.26-ml-review-pt2.pdf.

Can someone help me interpret this diagram? I understand the graph on the left, but I don't understand how in the right graph, the test risk starts going back down. I'm unfamiliar with the "interpolating regime", so maybe that would explain some things.

A good interactive explanation can be seen at MLU-explain: https://mlu-explain.github.io/bias-variance/ — shamisen, May 18 '23 at 21:57

score 3 · Answer 1 · answered May 15 '23 at 17:48

The right plot is about Deep Double Descent, a phenomenon observed in Deep Learning that challenges the classical belief (left plot) in statistical learning theory.

The first half of the right plot, depicts the classical empirical risk minimization situation in which you seek for the optimal model capacity: the one that balances bias and variance, achieving a low training error and good generalization.
It has been observed that if you keep increasing the capacity of some model (assuming proper regularization) your training error goes to zero (the interpolation threshold) and, unexpectedly, the generalization (test) error decreases too as the capacity is increased!
This modern interpolation regime were over-parameterized models actually generalize much better than models with the just-right capacity contradicts the classical view and the bias-variance trade-off.

One explanation for this is that, for DL models variance is not only in the sampling of the data used for training (like assumed by classical ML), but also in the weight initialization, optimization, and training procedure as well. Seems that over-parameterization decreases such variance allowing for better generalization.

Got it, that makes sense. How does over-parameterization solve the variance from weight initialization, optimization, and training procedure? — 9j09jf02jsd, May 16 '23 at 18:56
One assumption can be that over-parameterized models have many weights, and some of them are redundant, also there could be symmetries among them. These can help either with the optimization and training (since some weights may already point in the right direction of improvement), and variance from weight init (redundancy may limit the variance of initialization.) — Luca Anzalone, May 20 '23 at 16:33
Hmm I'm not fully convinced, I thought the right "direction" of improvement is specified by the negative gradient - I don't see how more parameters helps this. And regarding weight init variance, I thought having more weights could create more local minima that SGD could get stuck in. I don't see how in general, more parameters => less variance — 9j09jf02jsd, May 20 '23 at 22:11

How does deep learning overcome overfitting?

1 Answers1