cross_val_score (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) uses the estimator’s default scorer (if available) and LinearRgression (the estimator I use - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) uses The coefficient of determination (which is defined as $R^2 = 1 - \frac{u}{v}$, where
$u$ is the residual sum of squares ((y_true - y_pred)** 2).sum()
and
$v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum()
.
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0.)
Is y_true.mean
the y_true.mean
of the training set or the testing set? If it's the one of the testing set, isn't it "cheating" i.e. we compare our predictions to a method that has inferred something from the testing set?
So doing better than a baseline wouldn't be having $R^2 > 0$ but rather $R^2 > -0.01$ or something like this?