Regression – Interpreting High $R^2$ and High $p$-Value in Simple Linear Regression

linear modelp-valuer-squaredregression

Let's assume that we have simple linear regression:
$\hat{y} = bx + \text{intercept}$.

Is it possible to have a high p-value and high $R^2$ (or low p-value and low $R^2$)? I've been looking for examples of this. When the linear regression has multiple parameters, I saw some examples where p-value for some parameters are low, but overall $R^2$ is low as well, but I was wondering if it's possible for the linear regression of a single parameter.

Best Answer

Yes, it is possible. The $R^2$ and the $t$ statistic (used to compute the p-value) are related exactly by:

$ |t| = \sqrt{\frac{R^2}{(1- R^2)}(n -2)} $

Therefore, you can have a high $R^2$ with a high p-value (a low $|t|$) if you have a small sample.

For instance, take $n = 3$. For this sample size to give you a (two-sided) p-value less then 10% you would need an $R^2$ greater than 85% -- anything less than that would give you "non-significant" p-value.

As a concrete example, the simulation below produces an $R^2$ close to 0.5 with a p-value of $0.516$.

set.seed(10)
n <- 3
x <- rnorm(n, 0, 1)
y <- 1 + x + rnorm(n, 0, 1)
summary(m1 <- lm(y ~ x))

Call:
lm(formula = y ~ x)

Residuals:
       1        2        3 
-0.36552  0.42802 -0.06251 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.7756     0.4261    1.82    0.320
x             0.5065     0.5333    0.95    0.516

Residual standard error: 0.5663 on 1 degrees of freedom
Multiple R-squared:  0.4743,    Adjusted R-squared:  -0.05148 
F-statistic: 0.9021 on 1 and 1 DF,  p-value: 0.5164

For the opposite case (low p-value with low $R^2$), you can trivially obtain that by setting a regression where $x$ has a low explanatory power and let $n \to \infty$ to get a p-value as small as you want.

Related Solutions

R Linear Regression – Categorical Variable “Hidden” Value in Linear Regression

Q: " ... how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here??

A: You have no doubt noticed that there is no mention of x2="High" in the output. At the moment x2High is chosen as the "base case". That's because you offered a factor variable with the default coding for levels despite an ordering that would have been L/M/H more naturally to the human mind. But "H" being lexically before both "L" and "M" in the alphabet, was chosen by R as the base case.

Since 'x2' was not ordered, each of the reported contrasts were relative to x2="High" and so x2=="Low" was estimated at -0.78 relative to x2="High". At the moment the Intercept is the estimated value of "Y" when x2="High" and x1= 0. You probably want to re-run your regression after changing the levels ordering (but not making the factor ordered).

x2a = factor(x2, levels=c("Low", "Medium", "High"))

Then your 'Medium' and 'High' estimate will be more in line with what you expect.

Edit: There are alternative coding arrangements (or more accurately arrangements of the model matrix.) The default choice for contrasts in R is "treatment contrasts" which specifies one factor level (or one particular combination of factor levels) as the reference level and reports estimated mean differences for other levels or combinations. You can, however have the reference level be the overall mean by forcing the Intercept to be 0 (not recommended) or using one of the other contrast choices:

?contrasts
?C   # which also means you should _not_ use either "c" or "C" as variable names.

You can choose different contrasts for different factors, although doing so would seem to impose an additional interpretive burden. S-Plus uses Helmert contrasts by default, and SAS uses treatment contrasts but chooses the last factor level rather than the first as the reference level.

Solved – the relationship between R-squared and p-value in a regression

The answer is no, there is no such regular relationship between $R^2$ and the overall regression p-value, because $R^2$ depends as much on the variance of the independent variables as it does on the variance of the residuals (to which it is inversely proportional), and you are free to change the variance of the independent variables by arbitrary amounts.

As an example, consider any set of multivariate data $((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i))$ with $i$ indexing the cases and suppose that the set of values of the first independent variable, $\{x_{i1}\}$, has a unique maximum $x^*$ separated from the second-highest value by a positive amount $\epsilon$. Apply a non-linear transformation of the first variable that sends all values less than $x^* - \epsilon/2$ to the range $[0,1]$ and sends $x^*$ itself to some large value $M \gg 1$. For any such $M$ this can be done by a suitable (scaled) Box-Cox transformation $x \to a((x-x_0)^\lambda - 1)/(\lambda-1))$, for instance, so we're not talking about anything strange or "pathological." Then, as $M$ grows arbitrarily large, $R^2$ approaches $1$ as closely as you please, regardless of how bad the fit is, because the variance of the residuals will be bounded while the variance of the first independent variable is asymptotically proportional to $M^2$.

You should instead be using goodness of fit tests (among other techniques) to select an appropriate model in your exploration: you ought to be concerned about the linearity of the fit and of the homoscedasticity of the residuals. And don't take any p-values from the resulting regression on trust: they will end up being almost meaningless after you have gone through this exercise, because their interpretation assumes the choice of expressing the independent variables did not depend on the values of the dependent variable at all, which is very much not the case here.

Best Answer

Related Solutions

R Linear Regression – Categorical Variable “Hidden” Value in Linear Regression

Solved – the relationship between R-squared and p-value in a regression

Related Question