Solved – way to calculate R-squared in OLS without computing the coefficients

least squaresr-squaredregressionwhite-test

The background of my question is that for e.g. the White heteroskedasticity test or the Breusch-Godfrey (LM) autocorrelation test, we are generally only interested in the R-squared of the "auxiliary" regression. However, the only way of computing said R-squared that I am aware of involves deriving the coefficients etc. This can potentially consume a lot of time due to a large number of regressors and thus a large dimension of the matrix that needs to be inverted (in the case of the White test, regressing the squared residuals on the independent variables, their squares and cross-products – the number of regressors is thus a quadratic function of the number of independent variables).

Is there an "alternative way" to calculate (or perhaps approximate) R-squared?

(I know that the problem could be avoided by using different tests – e.g. Breusch-Pagan instead of White for heteroskedasticity, Durbin-Watson instead of Breusch-Godfrey. However, I am interested in this question both for the fun of it and because the mentioned alternative tests can be inferior to the ones mentioned at the beginning.)

Best Answer

No, given a multiple regression, there is no way to compute R-squared while avoiding the bulk of the other computations. You can certainly avoid computing the coefficients themselves, but the main work of the computation still needs to be done.

Note however that no matrix is ever inverted during a linear regression if the computation is done properly. There are many answers on this site that explain that, for example Residual Sum of squares in Weighted regression

Here is what might be the minimum possible computation to get R-square. You have to somehow orthogonalize $y$ for the regression covariates, and the QR decomposition is the most-used method of doing that. Let's assume we have a $y$ vector of 10 observations:

    > y <- rnorm(10)

and an $X$ matrix with 2 predictors:

    > x1 <- rnorm(10)
    > x2 <- rnorm(10)

The quickest way to get R-square would be like this. First mean correct:

    > y.c <- y-mean(y)
    > x1.c <- x1-mean(x1)
    > x2.c <- x2-mean(x2)

Then compute a QR matrix decomposition for $X$ and $y$ together:

    > QR <- qr( cbind(x1.c, x2.c, y.c) )

Then R-squared is one minus the proportion of the sum of squares that still remains:

    > Rsquared <- 1 - QR$qr[3,3]^2 / sum(y.c^2)
    > Rsquared
          y.c
    0.3266491

We can confirm that this is correct:

    > fit <- lm(y ~ x1+x2)
    > summary(fit)
    
    Call:
    lm(formula = y ~ x1 + x2)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -2.44213 -0.47947  0.08121  0.89085  1.54395 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)   0.5032     0.4761   1.057    0.326
    x1            0.5330     0.4118   1.294    0.237
    x2           -0.6153     0.4215  -1.460    0.188
    
    Residual standard error: 1.323 on 7 degrees of freedom
    Multiple R-squared:  0.3266,    Adjusted R-squared:  0.1343 
    F-statistic: 1.698 on 2 and 7 DF,  p-value: 0.2505

By the way, if you don't want to bother de-meaning the x-variables, then you can compute the QR decomposition using the entire design matrix including the intercept:

    > QR <- qr( cbind(1, x1, x2, y) )
    > Rsquared <- 1 - QR$qr[4,4]^2 / sum(y.c^2)

This gives the same result because the QR decomposition orthogonalizes each column in succession with respect to the previous columns and de-meaning simply orthogonalizes all the columns with respect to the constant vector.

Related Solutions

Solved – Autocorrelation and heteroskedasticity in time series data

We have seen residual plots such as yours when untreated deterministic effects are present. These might include hourly or daily effects. Care should be taken to identify and incorporate any needed effects like Pulses,Level Shifts,Seasonal Pulses and/or Local Time Trends . Needed ARIMA Structure suggested by model diagnostics should also be included. At that point consider testing for constancy of parameters over time as parameters may have changed OR testing for constancy of error variance over time . Non-constant error variance over time can be remedied by the TSAY test as described here http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html or the classic Box-Cox test . The TSAY test should be implemented first as it is the least intrusive transformation.

The correct Transfer Function identification procedure using pre-whitened cross-correlations in conjunction with the above should lead to a useful model. The process I have described here is fundamentally what I incorporated into AUTOBOX , a piece of software that I have helped to develop. You could attempt to program this yourself but it might be time consuming.

EDITED AFTER COMMENT BY OP

The ACF of the residuals from a tentative model is suggestive of the need to augment your current model with arima structure. The ccf of residuals from a tentative model and a prewhitened X variable suggests possible improvements in the TF structure. Yes to "see the initial lags" you use the CCF https://onlinecourses.science.psu.edu/stat510/node/75 and Why is prewhitening important?. It is literally a minefield of possible ways to do it rong which is why I automated the process. The automation doesn't guarantee optimality but it does clear the path. Advice from non-time series experts in this area should be studiously avoided as your problem can be daunting.

http://autobox.com/cms/images/dllupdate/TFFLOW.png is a start ... where one would add before forecasting checks for 1) constancy of parameters and 2) constancy of the variance of the error process . Now that you have a work statement (flow diagram) the next part is to implement it either manually or with creative productivity aids.

This is (somewhat) repetitive but ....

1.stationarity conditions i.e. the order of differences for both X and Y 2.the form of the X component in terms of needed lags i.e. numerator and denominator structure 3.the required arma 4.the need for Intervention detected variables viz. Pulses, Level Shifts, Seasonal Pulses, Local Time Trends 5.the need to deal with evidented parameter changes over time 6.the need to deal with evidented deterministic variance changes requiring Weighted Least Squares 7.the need to deal with evidented variance changes that are level dependent requiring a Power Transform

White Test – How to Use the White Test for Heteroscedasticity in R

Here is an implementation of the White test that works for me, at the time of writing (along with some manual calculations to illustrate the $nR^2$ format of the test statistic).

library(skedastic)

mtcars_lm <- lm(mpg ~ wt + hp, data = mtcars)
summary(mtcars_lm) # i.e. weight and horsepower bad for mileage

# canned
white_lm(mtcars_lm, interactions = TRUE, statonly = T)

# handmade
n <- dim(mtcars)[1]
u.hat.squared <- resid(mtcars_lm)^2
aux.reg <- lm(u.hat.squared~poly(cbind(wt,hp), 2), data = mtcars)

summary(aux.reg) # fyi, to show polynomials

(white.stat <- n*summary(aux.reg)$r.squared)


n*(1-sum(resid(aux.reg)^2)/sum((u.hat.squared-mean(u.hat.squared))^2))

k <- 5
rssr <- sum((u.hat.squared-mean(u.hat.squared))^2)
ussr <- sum(resid(aux.reg)^2)
(rssr-ussr)/rssr*n # white stat
(rssr-ussr)/rssr  # r-squared

Some things that also strike me in your post:

Why use White if you have already done Breusch-Pagan? Both serve the same purpose.

I do not think you use I correctly when generating the square of the log. Try the following for illustration:

set.seed(1)
y <- rnorm(19)
x <- runif(19)

lm(y~I(log(x)^2))
lm(y~I(log(x))^2)
lm(y~I(log(x)))

Best Answer

Related Solutions

Solved – Autocorrelation and heteroskedasticity in time series data

White Test – How to Use the White Test for Heteroscedasticity in R

Related Question