Linear Regression – Interpreting R^2 with No Variation in Response Variable

correlationlinear modelregression

Suppose I wish to fit $\hat{y} = \beta_0 + \beta_1x$ where the the data is as follows:

x = 0.0, 0.1, 0.2, 0.3, 0.4
y = 0.0, 0.0, 0.0, 0.0, 0.0

Clearly, $\hat{\beta_1} = 0$ and $\hat{\beta_0} = 0$. But what is $R^2$ in this instance?

Suppose I calculate:

$$r = \frac{n S_{xy} – S_xS_y}{\sqrt{(nS_{xx} – S_x^2) (nS_{yy} – S_y^2)}}$$

or,

$$R^2 = 1 – \frac{SS_{res}}{SS_{tot}}$$

Then both will be NaN/Undefined since the denominator in both instances will be zero.

So, for this particular dataset, is $R^2$ actually defined? I would hesitate to guess that it should be 1, given the data fits the model perfectly?

Best Answer

The following plots are accompanied by their Pearson product-moment correlation coefficients (image credit):

Pearson correlation for various scatter plots

If the points lie exactly on an upwards sloping line then the Pearson correlation is +1, if they lie exactly on a downwards sloping line the correlation is -1. But notice that the horizontal line has an undefined correlation.

At first sight you might expect this to be zero, as a compromise between +1 and -1. You may have thought that since positive correlation means "as one variable increases, the other tends to increase" while negative correlation means "as one variable increases, the others tends to decrease", the fact that $Y$ neither tends to increase nor decrease as $X$ increases means that $r=0$. That idea is correct for the other plots labelled $r=0$, but they all exhibited variation in $Y$. Correlation is symmetric: the correlation between $X$ and $Y$ is the same as that between $Y$ and $X$. Turning things around, in the $r=0$ plots we see that as $Y$ increases, $X$ neither tends to increase nor decrease. But in our case what happens to $X$ as $Y$ changes? We just don't know! We certainly can't claim (as $r=0$ would imply) that $X$ would neither tend to increase nor decrease. We never got a chance to see it, because $Y$ never varied. Intuitively, there's no way we can determine the correlation from the available data.

More technically, consideration of the formula for PMCC should clarify things:

$$r = \frac{\text{Covariance of X and Y}}{\text{SD of X} \times \text{SD of Y}}$$

where "SD" stands for standard deviation. On a completely horizontal line, the standard deviation of $Y$ is zero because that variable does not vary at all. So we have zero on the denominator. Also since $X$ and $Y$ can not co-vary, then the covariance is zero, and the numerator is zero also. Hence the fraction is $\frac{0}{0}$ which is an indeterminate form and so the correlation coefficient is not defined.

In a simple linear regression model (only one response and one predictor variable plus an intercept), the coefficient of determination $R^2$ is simply the square of $r$, the PMCC between $X$ and $Y$. Unsurprisingly, this will not be defined either. This is intuitive if we think about $R^2$ as the proportion of variance explained - here the response variable has no variation, so we can explain 0 out of 0 variance, which as a proportion brings us back to the indeterminate form $\frac{0}{0}$.

This conclusion holds true regardless of whether the recorded data are all identically zero, or identically some other number, so long as it would give a horizontal line in a graph of $Y$ against $X$. Note that there may be a difference between the "true" values of $Y$ and those that have been recorded in the data set to the specified level of accuracy. It's possible in a case such as yours that the correct values of $Y$ all round to 0.0 to one decimal place, but if we had access to them to full accuracy, we may be able to observe very small deviations about 0. If that were the case then the actual PMCC and coefficient of determination would both exist, and (i) be approximately equal to zero if the small deviations were just "noise", (ii) be anything up to and including 1 if the small deviations formed an increasing trend indiscernible at the current level of accuracy, or (iii) be anything up to and including $r = -1$ and $R^2 = 1$ if they formed a currently indiscernible downwards trend.

In this answer I have only considered the case of simple linear regression, where the response depends on one explanatory variable. But the argument also applies to multiple regression, where there are several explanatory variables. I'll assume the model includes an intercept term, since dropping the intercept is rarely a good idea and even with a model without an intercept, it's unlikely you want to calculate $R^2$. So long as the intercept is included in the model, then $R^2$ is just the square of multiple correlation coefficient $R$, which is the PMCC between the observed values of the response $Y$ and the values fitted by the model. If $Y$ shows no variation (at least to the recorded accuracy) then the same considerations prevent you calculating $R$ and hence $R^2$.

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

1) Residuals do correlate positively with observed values in many, many cases. Think of it this way - a very large positive error ("error" is the "true residual", to misuse the language) means that the corresponding observation is, all other things equal, likely to be very large in a positive direction. A very large negative error means that the corresponding observation is likely to be very large in a negative direction. If the $R^2$ of the regression is not large, then the variability of the errors will be the dominating effect on the variability of the target variable, and you will see this effect in your plots and correlations.

For example, consider the model $y_i = a + x_i + e_i$, which we'll model as $y_i = a + bx_i + e_i$, (which is correct for $b = 1$.) Here's the result of a regression with 100 observations:

e <- rnorm(100)
x <- rnorm(100)
y <- 1 + x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

> summary(foo)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3292 -0.8280 -0.0448  0.8213  2.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8498     0.1288   6.600 2.12e-09 ***
x             0.8929     0.1316   6.787 8.81e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.286 on 98 degrees of freedom
Multiple R-squared: 0.3197, Adjusted R-squared: 0.3128 
F-statistic: 46.06 on 1 and 98 DF,  p-value: 8.813e-10

enter image description here

Note that we achieved a fairly respectable (in some fields) $R^2$ of 0.32.

We can obscure this effect with a different model:

y <- 1 + 5*x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

which has an $R^2$ of 0.93 and the following residual plot:

enter image description here

Here the correlation between $y$ and the residuals is about 0.25, but it's a lot less obvious on the plot.

2) Residuals have correlation zero with fitted values in a linear regression, by construction. Is your statement "... weakly correlated with fitted Y negatively" based solely upon looking at the plot, or did you actually calculate the correlation? If the former, appearances can be deceiving... if the latter, something is wrong; possibly you aren't looking at what you think you're looking at.

Solved – Linear regression with strongly non-normal response variable

The distribution of the response is irrelevant. Inference based on small samples requires the errors to be approximately normal (better look at the QQ-plot of the residuals than at its density because the tails are important). If you are only interested in descriptive results or if the sample size is not too small, you therefore do not need to worry about normality.

Much more important are the other assumptions of linear regression (correct model structure, no large outliers in the predictors and, if you are interested in inference, homoscedastic and uncorrelated errors).

Best Answer

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

Solved – Linear regression with strongly non-normal response variable

Related Question