Which is a better form of regularization: lasso (L1) or ridge (L2)?

Question

Given a ridge and a lasso regularizer, which one should be chosen for better performance?

An intuitive graphical explanation (intersection of the elliptical contours of the loss function with the region of constraints) would be helpful.

L1 and L2 have slightly different effects. Have you read anything about them? — nbro, Feb 07 '20 at 20:15
Yes, I do have an understanding in terms of the math involved. I am interested to know about the effect that it has on the parameters, that help in achieving the regularization effect. — jaeger6, Feb 07 '20 at 20:19

s_bh · Accepted Answer · 2020-07-24T10:57:48.243

The following graph shows the constraint region (green), along with contours for Residual sum of squares (red ellipse). These are iso-lines signifying that points on an ellipse have the same RSS. Figure: Lasso (left) and Ridge (right) Constraints [Source: Elements of Statistical Learning]

As Ridge regression has a circular constraint ($\beta_1^2 + \beta_2^2 <= d$) with no edges, the intersection will not occur on an axis, signifying that the ridge regression parameters will usually be non-zero.

On the contrary, the Lasso constraint ($|\beta_1| + |\beta_2| <= d$) has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. In 2D, such a scenario would result in one of the parameters to become zero whereas in higher dimensions, more of the parameter estimates may simultaneously reach zero.

This is a disadvantage of ridge regression, wherein the least important predictors never get eliminated, resulting in the final model to contain all predictor variables. For Lasso, the L1 penalty forces some parameters to be exactly equal to zero when $\lambda$ is large. This has a dimensionality reduction effect resulting in sparse models.

In cases where the number of predictors are small, L2 could be chosen over L1 as it constraints the coefficient norm retaining all predictor variables.

Your answer describes L1 and L2 regularization and their effect, but how does this relate to the concept of regularization? You just clarify what a person means by "regularization" and in which cases is L1 preferred over L2 (and vice versa), which is the actual question (I think). — nbro, Feb 07 '20 at 20:59
Given that the author of the question asked for a graphical intuition behind regularization (with the constraint curve), I am given to understand he has knowledge of what "Regularization" signifies. The answer simply does not warrant a definition of the term. — s_bh, Feb 07 '20 at 21:26

Which is a better form of regularization: lasso (L1) or ridge (L2)?

1 Answers1