Regression – Calculating Standard Error of Regression Coefficient Without Raw Data

descriptive statisticsregressionself-studystandard error

After searching here:

Perform simple regression without raw data

I am still curious about this.

Is it possible to derive the standard error of a regression coefficient from summary data alone?

E.g., assume we are given the following variance-covariance matrix.

$\begin{bmatrix}Var(X) & Cov(X,Y)\\Cov(X,Y) & Var(Y)\end{bmatrix}$

We can derive the regression coefficient $\beta_{XY} = Cov(X,Y) / Var(X)$.

Given a specific n, is it possible to derive the standard error of $\beta$ as well? If so, which formula is being used? It appears that all formulas for regression standard errors that I could find assume that you know the variance of residuals of the regression, which we don't know from summary data alone.

—UPDATE—
Based on feedback by Greg and whuber, I am now at a point, where I know that

$\sigma^2(b) = \sigma^2(X'X)^{-1}=\sigma^2(Var(X)(n-1))^{-1}$,

but typically matrix formulas for the standard error of $b$ then note that $\sigma^2$, the unknown variance of the errors, is estimated with MSE.

Based on comments by whuber, sum of squares residuals is:

$Y'Y – bX'X$, which according to Greg can be spelled out as

$Var(Y)(n-1) – b(Var(x)(n-1))$

And to go from here to MSE, I think (?) all that is left is divide by n-2,

so
$(Var(Y)(n-1) – b(Var(x)(n-1))) / (n-2)$

And plugging this all in yields:

$\sigma^2(b) =((Var(Y)(n-1) – b(Var(X)(n-1))) / (n-2))\times(Var(X)(n-1))^{-1}$

I will try to see if this works out in R, and will report back.
If anyone sees something blatantly wrong, I appreciate the heads up.

Best Answer

With the usual notation, arrange the independent variables by columns in an $n\times (p+1)$ array $X$, with one of them filled with the constant $1$, and arrange the dependent values $Y$ into an $n$-vector (also a column). Assuming the appropriate inverses exist, the desired formulas are

$$b = (X^\prime X)^{-1} X^\prime Y$$

for the $p+1$ parameter estimates $b$,

$$s^2 = (Y^\prime Y - b^\prime X^\prime Y)/(n-p-1)$$

for the residual standard error, and

$$V = (X^\prime X)^{-1} s^2$$

for the variance-covariance matrix of $b$.

When only the summary statistics $\newcommand{\m}{\mathrm m} \m_X$ (the means of the columns of $X$, forming a $p+1$-covector which includes a mean of $1$ for the constant), $m_Y$ (the mean of $Y$), $\text{Cov}(Y)$, $\text{Var}(Y)$, and $\text{Cov}(X,Y)$ are available, first recover the needed values via

$$(X^\prime X)_0 = (n-1)\text{Cov}(X) + n \m_X \m_X^\prime,$$

$$Y^\prime Y = (n-1)\text{Var}(Y) + n m_Y,$$

$$(X^\prime Y)_0 = (n-1)\text{Cov}(X,Y) + n m_Y \m_X.$$

Then to obtain $X^\prime X$, border $(X^\prime X)_0$ symmetrically by a vector of column sums (given by $n \m_X$) with the value $n$ on the diagonal; and to obtain $X^\prime Y$, augment the vector $(X^\prime Y)_0$ with the sum $n m_Y$. For instance, when using the first column for the constant term, these bordered matrices will look like

$$X^\prime X = \pmatrix{ n & n\m_X \\ n\m_X^\prime & (X^\prime X)_0 }$$

and

$$X^\prime Y = \left(n m_Y, (X^\prime Y)_0\right)^\prime$$

in block-matrix form.

If the means are not available--the question did not indicate they are--then replace them all with zeros. The output will estimate an "intercept" of $0$, of course, and its standard error of the intercept will likely be incorrect, but the remaining coefficient estimates and standard errors will be correct.


Code

The following R code generates data, uses the preceding formulas to compute $b$, $s^2$, and the diagonal of $V$ from only the means and covariances of the data (along with the values of $n$ and $p$ of course), and compares them to standard least-squares output derived from the data. In all examples I have run (including multiple regression with $p\gt 1$) agreement is exact to the default output precision (about seven decimal places).

For simplicity--to avoid doing essentially the same set of operations three times--this code first combines all the summary data into a single matrix v and then extracts $X^\prime X$, $X^\prime Y$, and $Y^\prime Y$ from its entries. The comments note what is happening at each step.

n <- 24
p <- 3
beta <- seq(-p, p, length.out=p)# The model
set.seed(17)
x <- matrix(rnorm(n*p), ncol=p) # Independent variables
y <- x %*% beta + rnorm(n)      # Dependent variable plus error
#
# Compute the first and second order data summaries.
#
m <- rep(0, p+1)                # Default means
m <- colMeans(cbind(x,y))       # If means are available--comment out otherwise
v <- cov(cbind(x,y))            # All variances and covariances
# 
# From this point on, only the summaries `m` and `v` are used for the calculations
# (along with `n` and `p`, of course).
#
m <- m * n                      # Compute column sums
v <- v * (n-1)                  # Recover sums of squares of residuals
v <- v + outer(m, m)/n          # Adjust to obtain the sums of squares
v <- rbind(c(n, m), cbind(m, v))# Border with the sums and the data count
xx <- v[-(p+2), -(p+2)]         # Extract X'X
xy <- v[-(p+2), p+2]            # Extract X'Y
yy <- v[p+2, p+2]               # Extract Y'Y
b <- solve(xx, xy)              # Compute the coefficient estimates
s2 <- (yy - b %*% xy) / (n-p-1) # Compute the residual variance estimate
#
# Compare to `lm`.
#
fit <- summary(lm(y ~ x))
(rbind(Correct=coef(fit)[, "Estimate"], From.summary=b))    # Coeff. estimates
(c(Correct=fit$sigma, From.summary=sqrt(s2)))               # Residual SE
#
# The SE of the intercept will be incorrect unless true means are provided.
#
se <- sqrt(diag(solve(xx) * c(s2))) # Remove `diag` to compute the full var-covar matrix
(rbind(Correct=coef(fit)[, "Std. Error"], From.summary=se)) # Coeff. SEs