Solved – R-squared result in linear regression and “unexplained variance”

regression

I did a linear regression in R and got the following result:

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)        192116.40    6437.27  29.844  < 2e-16 ***
cdd                   272.74      26.94  10.123 1.56e-09 ***
pmax(hdd - 450, 0)     61.73      22.54   2.738   0.0123 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 16500 on 21 degrees of freedom
Multiple R-squared: 0.8454, Adjusted R-squared: 0.8307 
F-statistic: 57.41 on 2 and 21 DF,  p-value: 3.072e-09 

My question regards the R-squared value, 0.83 and what it means if I want to specify approximate percentage contributions of each (monthly) variable.

EDIT:
See data, below. Say I take the first 12 hdd and cdd data points, and calculate the sum of the 12 predictions (i.e. the first year's total prediction), using the coefficients, above. The baseline (intercept) contribution to the year would be approximately 12 * 192116.40 = 2305397, right? Similarly, the cdd contribution to the year would be approximately 1608 * 272.74 = 438565.9, and hdd would be (after my hand-made hinge function) approximately 1329 * 61.73 = 82039.17. Summing the three values yields 2826002, which is within 1.3% of actual total usage (2862840, the sum of the first 12 elec's).

Can I then say that cdd contributes 438565.9/2826002= 0.1551895, or approximately 16% of the yearly total? Or do I need to take that and compensate for the adjusted R-squared: 0.1551895*0.8307= 0.1289159 (i.e. multiply by the adjusted R-squares), for approximately 13% of the total? Or is none of this correct reasoning?

My data is:

     elec  hdd cdd
1  235940  880   3
2  205380  772   4
3  211780  551   9
4  192220  281  68
5  221440  165 119
6  304840   15 364
7  283160    4 434
8  300440   11 339
9  272900   42 214
10 204220  322  44
11 201060  592   8
12 229460  784   2
13 214520 1064   0
14 197900  719   2
15 186660  618  15
16 195340  332  88
17 241200  109 159
18 260700   18 282
19 299940   29 367
20 293240    2 426
21 268740   51 159
22 208380  319  36
23 183820  452   7
24 231360  903   0

(The monthly billing cycle for elec can be anywhere from 29 to 32 days, so that injects a lot of variance right there. I do not yet have all of the billing cycle lengths to to a trading day kind of adjustment.)

Best Answer

$R^2$ is the squared correlation of the OLS prediction $\hat{Y}$ and the DV $Y$. In a multiple regression with three predictors $X_{1}, X_{2}, X_{3}$:

# generate some data
> N  <- 100
> X1 <- rnorm(N, 175, 7)                                 # predictor 1
> X2 <- rnorm(N,  30, 8)                                 # predictor 2
> X3 <- abs(rnorm(N, 60, 30))                            # predictor 3
> Y  <- 0.5*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 10)  # DV
> fitX123 <- lm(Y ~ X1 + X2 + X3)  # regression
> summary(fitX123)$r.squared       # R^2
[1] 0.6361916

> Yhat <- fitted(fitX123)          # OLS prediction Yhat
> cor(Yhat, Y)^2
[1] 0.6361916

$R^2$ is also equal to the variance of $\hat{Y}$ divided by the variance of $Y$. In that sense, it is the "variance accounted for by the predictors".

> var(Yhat) / var(Y)
[1] 0.6361916

The squared semi-partial correlation of $Y$ with a predictor $X_{1}$ is equal to the increase in $R^2$ when adding $X_{1}$ as a predictor to the regression with all remaining predictors. This may be taken as the unique contribution of $X_{1}$ to the proportion of variance explained by all predictors. Here, the semi-partial correlation is the correlation of $Y$ with the residuals from regression where $X_{1}$ is the predicted variable and $X_{2}$ and $X_{3}$ are the predictors.

# residuals from regression with DV X1 and predictors X2, X3
> X1.X23 <- residuals(lm(X1 ~ X2 + X3))
> (spcorYX1.X23 <- cor(Y, X1.X23))   # semi-partial correlation of Y with X1
[1] 0.3172553

> spcorYX1.X23^2                     # squared semi-partial correlation
[1] 0.1006509

> fitX23 <- lm(Y ~ X2 + X3)          # regression with DV Y and predictors X2, X3

# increase in R^2 when changing to full regression
> summary(fitX123)$r.squared - summary(fitX23)$r.squared
[1] 0.1006509
Related Question