2

I am using raw data set with 4 feature variables (Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, and Cigraeette count) to do a Binominal Classification (find stroke likelihood) using Logistic Regression Algorithm.

I made sure that the class counts are balanced. i.e., an equal number of occurrences per class.

using Python + sklearn, the problem is that the classification performance gets very negatively-impacted when I try to normalize the dataset using

 X=preprocessing.StandardScaler().fit(X).transform(X)

or

  X=preprocessing.MinMaxScaler().fit_transform(X)

So before normalizing the dataset:

         precision    recall  f1-score   support

      1       0.70      0.72      0.71        29
      2       0.73      0.71      0.72        31

avg / total   0.72      0.72      0.72        60

while after normalizing the dataset (the precision of class:1 decreased significantly) precision recall f1-score support

      1       0.55      0.97      0.70        29
      2       0.89      0.26      0.40        31

 avg / total  0.72      0.60      0.55        60

Another observation that I failed to find an explanation to is the probability of each predicted class.

Before the normalization:

 [ 0.17029846  0.82970154]
 [ 0.47796534  0.52203466]
 [ 0.45997593  0.54002407]
 [ 0.54532438  0.45467562]
 [ 0.45999462  0.54000538]

After the normalization ((for the same test set entries))

 [ 0.50033247  0.49966753]
 [ 0.50042371  0.49957629]
 [ 0.50845194  0.49154806]
 [ 0.50180353  0.49819647]
 [ 0.51570427  0.48429573] 

Dataset description is shown below:

       TOTCHOL    SYSBP    DIABP  CIGPDAY   STROKE
count  200.000  200.000  200.000  200.000  200.000
mean   231.040  144.560   81.400    4.480    1.500
std     42.465   23.754   11.931    9.359    0.501
min    112.000  100.000   51.500    0.000    1.000
25%    204.750  126.750   73.750    0.000    1.000
50%    225.500  141.000   80.000    0.000    1.500
75%    256.250  161.000   90.000    4.000    2.000
max    378.000  225.000  113.000   60.000    2.000

SKEW is

TOTCHOL    0.369
SYSBP      0.610
DIABP      0.273
CIGPDAY    2.618
STROKE     0.000

Is there a logical explanation for the decreased precision?

Is there a logical explanation for the very-close-to-0.5 probabilities?

GYSHIDO
  • 51
  • 4

1 Answers1

0

I found the answer to my question, went back to the Python script and in the command that fits the model i.e.

 LR = LogisticRegression (C=0.1, solver = "sag",max_iter=1000).fit (X_train, y_train)

The parameter C was set to 0.001 which is a very small value (meaning lambda is very high as C=1/lambda) (C is the regularization strength and smaller values indicate stronger regularization). more on that matter can be found here and here

GYSHIDO
  • 51
  • 4