Is it compulsary to normalize the dataset if doing so can negatively impact a Binary Logistic regression performance?

Question

I am using raw data set with 4 feature variables (Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, and Cigraeette count) to do a Binominal Classification (find stroke likelihood) using Logistic Regression Algorithm.

I made sure that the class counts are balanced. i.e., an equal number of occurrences per class.

using Python + sklearn, the problem is that the classification performance gets very negatively-impacted when I try to normalize the dataset using

 X=preprocessing.StandardScaler().fit(X).transform(X)

or

  X=preprocessing.MinMaxScaler().fit_transform(X)

So before normalizing the dataset:

         precision    recall  f1-score   support

      1       0.70      0.72      0.71        29
      2       0.73      0.71      0.72        31

avg / total   0.72      0.72      0.72        60

while after normalizing the dataset (the precision of class:1 decreased significantly) precision recall f1-score support

      1       0.55      0.97      0.70        29
      2       0.89      0.26      0.40        31

 avg / total  0.72      0.60      0.55        60

Another observation that I failed to find an explanation to is the probability of each predicted class.

Before the normalization:

 [ 0.17029846  0.82970154]
 [ 0.47796534  0.52203466]
 [ 0.45997593  0.54002407]
 [ 0.54532438  0.45467562]
 [ 0.45999462  0.54000538]

After the normalization ((for the same test set entries))

 [ 0.50033247  0.49966753]
 [ 0.50042371  0.49957629]
 [ 0.50845194  0.49154806]
 [ 0.50180353  0.49819647]
 [ 0.51570427  0.48429573]

Dataset description is shown below:

       TOTCHOL    SYSBP    DIABP  CIGPDAY   STROKE
count  200.000  200.000  200.000  200.000  200.000
mean   231.040  144.560   81.400    4.480    1.500
std     42.465   23.754   11.931    9.359    0.501
min    112.000  100.000   51.500    0.000    1.000
25%    204.750  126.750   73.750    0.000    1.000
50%    225.500  141.000   80.000    0.000    1.500
75%    256.250  161.000   90.000    4.000    2.000
max    378.000  225.000  113.000   60.000    2.000

SKEW is

TOTCHOL    0.369
SYSBP      0.610
DIABP      0.273
CIGPDAY    2.618
STROKE     0.000

Is there a logical explanation for the decreased precision?

Is there a logical explanation for the very-close-to-0.5 probabilities?

score 0 · Answer 1 · answered Sep 16 '19 at 09:03

I found the answer to my question, went back to the Python script and in the command that fits the model i.e.

 LR = LogisticRegression (C=0.1, solver = "sag",max_iter=1000).fit (X_train, y_train)

The parameter C was set to 0.001 which is a very small value (meaning lambda is very high as C=1/lambda) (C is the regularization strength and smaller values indicate stronger regularization). more on that matter can be found here and here

Is it compulsary to normalize the dataset if doing so can negatively impact a Binary Logistic regression performance?

1 Answers1