1

I'm learning machine learning by looking through other people's kernel on Kaggle, specifically this Mushroom Classification kernel.

The author first applied PCA to the transformed indicator matrix. He only used 2 principal components for visualization later. Then I checked how much variance it has maintained, and found out that only 16% variance is maintained.

in [18]: pca.explained_variance_ratio_.cumsum()
out[18]: array([0.09412961, 0.16600686])

But the test result with 90% accuracy suggests it works well.

If variance stands for information, then how can the ML model work well when so much information is lost?

nbro
  • 39,006
  • 12
  • 98
  • 176
Bicheng
  • 111
  • 1

1 Answers1

0

Because it selects both Xtrain and Xtest from the space of two selected principal components. Hence, the 90% accuracy is in that 2-D selected space.

This fact that the ratio in PCA stands the information, depends on the distribution of the data and it's not true at all.

OmG
  • 1,731
  • 10
  • 19