Solved – How to define R squared on a subset of the original sample

anovalinear modelr-squaredregression

Suppose I fit a linear model

$$y = X\beta + \epsilon$$

and I can derive the coefficient of determination, also known as the R squared, by calculating

$$R^2 = 1 – \frac{SS_{res}}{SS_{tot}}$$

where

$$SS_{res}=\Sigma e_i^2$$ and

$$SS_{tot}=\Sigma (y_i-\bar{y})^2$$

Now I also want to know how well the $e>0$ side and $e<0$ side fits, respectively. Is there a measure to determine the "R squared" of a linear model on a subset of the original sample which also results in $[0,1]$?

I notice there's a concept called partial R squared but it is about measuring the fitness of a subset of regressors rather than a sub-sample.

Best Answer

$$R^2 = \dfrac{SSTot-SSRes}{SSTot}$$

Let's break down what those mean.

$$SSTot = \sum_i (y_i - \bar{y})^2$$

The regression that you're doing wants to predict the mean of the distribution of your response variable conditioned on some predictors. In the absence of knowing anything about how your data are generated, why not guess the overall mean? $SSTot$ is the total sum of squares and measures your error when you use the overall mean of $y$ as the prediction, no matter what predictors you have. This may be a naive approach, but it's a good baseline.

$$SSRes = \sum_i (y_i - \hat{y})^2$$

However, now that you've run your regression, you think you have more insight than you did when you were just guessing the overall mean of all $y$ values. Now see how much error you have when you use your predictions from the regression!

With those two values calculated, you can do the arithmetic to find $R^2$. Now, you want to do it on a subset of the data. I see two options. Let there be $j$-many observations in your subset.

1) $\dfrac{\sum_j (y_j - \bar{y})^2 - \sum_j (y_j - \hat{y})^2}{\sum_j (y_j - \bar{y})^2}$

2) $\dfrac{\sum_j (y_j - \bar{y}_{subset})^2 - \sum_j (y_j - \hat{y})^2}{\sum_j (y_j - \bar{y}_{subset})^2}$

The first option uses the same average value as you get when you look at the whole data set, while the second computes the average of your subset. I think I can squint and see a reason to do option #2, but I wouldn't do it. $R^2$ is a way of measuring how you do compared to naively guessing the overall mean, so I'd want to see how each subset does compared to guessing the overall mean.

Edit: Thinking about it more, I completely reject option #2. If you want to compare to the mean of the subset, rerun the regression on just the subset and calculate $R^2$ the usual way, but then you're not using the same regression equation or even the same significant parameters (it's a totally different problem).

Related Question