2

BACKGROUND: There is a lot of information online about the problem of multicollinearity as it relates to machine learning and how to identify correlated features. However, I am still unclear on which variables to eliminate once a correlated subset of feature variables has been identified.

QUESTION: Once a correlated set of 2 or more feature variables has been identified by some method, how does one decide which one to retain?

nbro
  • 39,006
  • 12
  • 98
  • 176
Snehal Patel
  • 912
  • 1
  • 1
  • 25
  • 1
    Why eliminate any of the variables? [While I understand some arguments for doing so, it creates issues that should be considered.](https://stats.stackexchange.com/questions/555145/ridge-regression-for-multicollinearity-and-outliers/555163#555163) – Dave Dec 24 '22 at 13:47
  • @Dave, as a novice, I found that post to be confusing. It seemed like there was heated controversy over the issue. Nonetheless, I understand the sentiment that if both (assuming we have 2) correlated features are improving the model, then "why not keep both?". – Snehal Patel Dec 25 '22 at 02:21

2 Answers2

2

In practice multicollinearity could be very common if your features really act as correlated causes for your target. If multicollinearity is moderate or you're only interested in using your trained ML model to predict out of sample data with some reasonable goodness-of-fit stats and not concerned with understanding the causality between the predictor variables and target variable, then multicollinearity doesn’t necessarily need to be resolved, even a simple multivariable linear regression model could potentially work well.

In case you really do need to address multicollinearity, then the quickest fix and often an acceptable solution in most cases is to remove one or more of the highly correlated variables. Specifically, you may want to keep the variable that has the strongest relationship with the target per domain knowledge and that has the least overlap with other retained variables as this is intuitively to be the most informative for prediction.

Secondly you can try linearly combine the predictor variables in some way such as adding or subtracting them. By doing so, you can create new variables that encompasses the information from several correlated variables and you no longer have an issue of multicollinearity. If still troublesome to decide which to retain, you can employ dimensionality reduction techniques such as principal component analysis (PCA) or partial least squares (PLS) or regularization techniques such as Lasso or Ridge regression which can be used to identify the most important variables in a correlated set.

mohottnad
  • 711
  • 1
  • 1
  • 9
1

I appreciate you for asking the question. Well, speaking of statistics, the problem of multicollinearity is catered to using partial correlation. Also, The correlation matrix is analyzed to understand the impact of independent features on the target variable (Output). It's quite a good practice to eliminate features which have very less or no correlation with the target.

But if you are worried about the multicollinearity issue then see the correlation across the target variable of all features if whichever feature has less correlation with the target variable drop it. Let's say A B C D E are five variables where E is the target and others are features determining E. So if A and B have a correlation of 0.7. A and E have 0.8, and B and E have 0.7. Then it makes sense to drop B. Reason: Since we know, A and B are correlated and A is also correlated with the target variable. Then the impact of the B over E may be because of the fact that it's correlated with the feature A. Therefore it's a feature that can be dropped leaving no impact on the model.

But again, just compare both the results when keeping B in the set and when excluding it to witness the difference. One of the issues of multicollinearity is faced in the classification tasks do check out this blog post

oseekero
  • 31
  • 5
  • Thanks you. That is an interesting strategy for deciding which one to keep. Are you also saying that we should build a model with A B C D and with A C D, and if the former is more accurate, then we keep B in the model, regardless of multicollinearity? – Snehal Patel Dec 23 '22 at 22:47
  • Yes, if your model works better then it doesn't hurt to have some multicollinearity among the features. As it's not that the features should not have multicollinearity at all but rather they should have it at a minimum. Its strictly followed in algorithms like linear regression. But if your model fits better then include it. – oseekero Dec 24 '22 at 06:03
  • This "trick" is based on what logic? Why should we drop the feature that has the highest correlation with the target? Why not the opposite? Why would this be a good idea? You should edit your post to clarify this. – nbro Dec 24 '22 at 10:37
  • I apologize if I wasn't clear before, as per your suggestion I have edited the post. I didn't mention dropping the feature that has the highest correlation in the post. I suggest dropping the feature which is correlated with other features more than the target variable. Please, look at the example I gave in my post. Do comment back if there is any doubt regarding that. – oseekero Dec 24 '22 at 13:17
  • I have a slight issue with the comment: "It's quite a good practice to eliminate features which have very less or no correlation with the target." It is possible that a poorly correlated feature may have significant impact (e.g., through interaction with a second feature) on overall model performance, even if it is poorly correlated with the target. In my opinion, a low correlation between the feature and target is just a soft sign that the feature may not be relevant. Eliminating the feature may be a good for an initial "lean" model, but it should be explored more for the final model. – Snehal Patel Dec 25 '22 at 02:04
  • I also have an issue with the comment: "The correlation matrix is analyzed to understand the impact of independent features on the target variable (Output)." The correlation matrix assess correlation between pairs of features (and possibly the target if it is included). The features may or may not be independent; rather, assessing the independence of the features (and possibly target) is a major purpose of the correlation matrix. – Snehal Patel Dec 25 '22 at 02:12
  • Yes, by independent I meant to address the features that we use to determine the target variable. The independent features mean that they are not dependent on any other variable for input. Let's say you create an ml application where there are 10 features and then you get an output. But all 10 features are independent in the sense you are given anything as input. So, in terms of degrees of freedom, there will be 10 for this model. Hence, when I used the word independent I didn't mean statistical independence. And about correlation matrix (CM), I was stating a use case of CM. – oseekero Dec 29 '22 at 17:51
  • So you just meant "different" features, correct? If so, then it may be worth clarifying your Answer to remove the confusion. – Snehal Patel Jan 02 '23 at 15:05