2

I am trying to generate a model that uses several physicochemical properties of a molecule (including number of atoms, number of rings, volume, etc.) to predict a numeric value $Y$. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn.

The type and range for the features varies. Some are int64 while others are floating point numbers. Some features generally have small (positive or negative) values, while others have a very large value. I have tried using various scalers (e.g. standard scaler, normalize, min-max scaler, etc.). Yet, the R2/Q2 are still low.

I have a few questions:

  1. Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?

  2. If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?

  3. Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?

nbro
  • 39,006
  • 12
  • 98
  • 176
Yannick
  • 21
  • 1

1 Answers1

1

In general, algorithms that exploit distances or similarities (e.g. in the form of scalar product) between data samples, such as k-NN and SVM, are sensitive to feature transformations. We do feature scaling to make our model robust to outliers and make an initial impact of every feature on the model will be roughly similar

Graphical-model based classifiers, such as Fisher LDA or Naive Bayes, as well as Decision trees and Tree-based ensemble methods (RF, XGB) are invariant to feature scaling, but, still, it might be a good idea to rescale/standardize your data.

  1. You should explore your data more carefully, find the outliers, apply transformation if needed.

  2. Not sure if it is a good idea

  3. You can apply different preprocessing techniques like MinMaxScaller, Rank, Log transform, Extracting square root, StandartScaller and etc.

nbro
  • 39,006
  • 12
  • 98
  • 176