Why does L1 regularization yield sparse features?

Question

In contrast to L2 regularization, L1 regularization usually yields sparse feature vectors and most feature weights are zero.

What's the reason for the above statement - could someone explain it mathematically, and/or provide some intuition (maybe geometric)?

This is a relatively old question, but, for reproducibility, could you please provide a link to the article/paper where you took this excerpt that you're quoting from? — nbro, Jan 29 '21 at 22:25
Oops, at this point, I don't remember. I'll take care next time! — stoic-santiago, Jan 30 '21 at 12:17

Daniel B. · Accepted Answer · 2020-07-02T17:12:04.663

In L1 regularization, the penalty term you compute for every parameter is a function of the absolute value of a given weight (times some regularization factor). Thus, irrespective of whether a weight is positive or negative (due to the absolute value) and irrespective of how large the weight is, there will be a penalty incurred as long as weight is unequal 0. So, the only way how a training procedure can considerably reduce the L1 regularization penalty is by driving all (unnecessary) weights towards 0, which results in a sparse representation.

Of course, the L2 regularization will also only be strictly 0 when all weights are 0. However, in L2, the contribution of a weight to the L2 penalty is proportional to the squared value of the weight. Therefore, a weight whose absolute value is smaller than 1, i.e. $abs(weight) < 1$, will be much less punished by L2 than it would be by L1, which means that L2 puts less emphasis on driving all weights towards exactly 0. This is because squaring a some value in (0,1) will result in a value of lower magnitude than taking the un-squared value itself: $x^2 < x\ for\ all\ x\ with\ abs(x) < 1$.

So, while both regularization terms end up being 0 only when weights are 0, the L1 term penalizes small weights with $abs(x) < 1$ much more strongly than L2 does, thereby driving the weight more strongly towards 0 than L2 does.

Why does L1 regularization yield sparse features?

1 Answers1