0

cross_val_score (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) uses the estimator’s default scorer (if available) and LinearRgression (the estimator I use - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) uses The coefficient of determination (which is defined as $R^2 = 1 - \frac{u}{v}$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0.)

Is y_true.mean the y_true.mean of the training set or the testing set? If it's the one of the testing set, isn't it "cheating" i.e. we compare our predictions to a method that has inferred something from the testing set?

So doing better than a baseline wouldn't be having $R^2 > 0$ but rather $R^2 > -0.01$ or something like this?

Luca Anzalone
  • 2,120
  • 2
  • 13
  • You have `model.score(x_test, y_test)`. Then it computes $R^2$ according to $\hat{y}=model(x_\text{test})$. $u$ accounts for the predictions, whereas $v$ represents an ideal situation in which $\hat{y}$ is the same as `y_test` (i.e. no errors). You do the ratio of that, so what's the issue? – Luca Anzalone May 31 '23 at 20:14
  • From my understanding $v$ represents the situation where I predict the mean, not a situation with no errors? (otherwise $v$ would be $0$ and you'd be dividing by $0$) – FluidMechanics Potential Flows May 31 '23 at 21:30

1 Answers1

0

Let's clarify this with a numerical example. Let's assume y_true = [0, 1, 1] to be the true class labels of $N=3$ test points. Therefore, y_true.mean() = 2/3.

Case 1: everthing is wrongly predicted, so y_pred = [1, 0, 0]. Let's compute $u$ and $v$. $$ \begin{align} u &= \sum(y_\text{true}-y_\text{pred})^2 = (0-1)^2+(1-0)^2 +(1-0)^2 = 3 \\ v &= \sum \Big(y_\text{true}-\frac23\Big)^2 = (-2/3)^2+(1/3)^2+(1/3)^2 = \frac69 = \frac23 \end{align}$$ Now we compute $R^2$, so: $$R^2 = 1 - \frac{u}{v} = 1 - \frac{3}{2/3} = 1 - 2 = \mathbf{-1}$$ You get -1 because y_pred is always wrong. so you can think of $u$ as counting the number of errors (it's so if you predict the label to be exactly zero or one), whereas $v$ is the ratio of positive labels.

Case 2: perfect predictions, so y_pred = y_true. $$ \begin{align} u &= (0-0)^2+(1-1)^2 +(1-1)^2 = 0\quad\quad\text{(no errors)} \\ R^2 &= 1 -\frac{0}{2/3} = 1 \end{align}$$ $R^2=1$ because every label is correctly predicted, so there are no errors and the correlation is the most possible.

Case 3: some errors, say y_pred=[0, 1, 0]. Again: $$ \begin{align} u &= (0-0)^2+(1-1)^2 +(1-0)^2 = 1 \\ R^2 &= 1 -\frac{1}{2/3} = 1 - \frac32 = -\frac12 \end{align}$$ $R^2=-0.5$, so negative correlation. From this it seems that $R^2$, in general, gives more importance to correctly predicting the positive class.

So, there is no issue about using the mean of the true labels because that info is not used during training but only to compute an evaluation metric.

Luca Anzalone
  • 2,120
  • 2
  • 13
  • Maybe my post wasn't clear but I'm talking about `cross_val_score` which evaluates the coefficient of determination on the testing set? – FluidMechanics Potential Flows Jun 01 '23 at 22:48
  • I see. `cross_val_score` takes a model and a dataset ($X$, $y$). It trains the model $k$ times, each time leaves out (for validation) a "fold", and then computes the score (e.g. $R^2$) on that fold. You use `cross_val_score` to validate your model, so $(X, y)$ is actually the **training set** not the test set (otherwise you "cheat"). Once you determined the best model you retrain it on all the train-set, and do the final evaluation on test (e.g. by computing $R^2$). Read [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) for more. – Luca Anzalone Jun 02 '23 at 10:12
  • I see. But when computing the score (e.g. $R^2$), it does use `y_true.mean` in `v`, right? `y_true.mean` being the mean of the testing fold. – FluidMechanics Potential Flows Jun 02 '23 at 15:18
  • You perform `cross_val_score` on $(X_\text{train},y_\text{train})$, so for each fold (there are $k$ of them) you have $(X_\text{train}^k,y_\text{train}^k)$ and $(X_\text{val}^k,y_\text{val}^k)$. Evaluation is performed on the latter pair, and `y_true.mean` refers to $y_\text{val}^k$. You do this, see the results and if satisfied you finally evaluate on the test-set: you don't use `cross_val_score` otherwise it would train on test data which is wrong, but instead use [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) – Luca Anzalone Jun 02 '23 at 18:00
  • I see, you don't use `cross_val_score` to get the the coefficient of determination of your method. But, to get the coefficient of determination of your method, you calculate it on the test-set. However, I believe my questions still stand. The `y_true.mean` is now $y_{test}$ right? So when calculating the coefficient of determination on the test-set, `v` depends on the mean of the test set. So if $R^2_{test}$ is equal to 0 it means we do as well as a method that inferred a little bit something from the test set (`y_true.mean`) so we learned a little more than just noise? – FluidMechanics Potential Flows Jun 03 '23 at 13:52