After searching here:
Perform simple regression without raw data
I am still curious about this.
Is it possible to derive the standard error of a regression coefficient from summary data alone?
E.g., assume we are given the following variance-covariance matrix.
$\begin{bmatrix}Var(X) & Cov(X,Y)\\Cov(X,Y) & Var(Y)\end{bmatrix}$
We can derive the regression coefficient $\beta_{XY} = Cov(X,Y) / Var(X)$.
Given a specific n, is it possible to derive the standard error of $\beta$ as well? If so, which formula is being used? It appears that all formulas for regression standard errors that I could find assume that you know the variance of residuals of the regression, which we don't know from summary data alone.
—UPDATE—
Based on feedback by Greg and whuber, I am now at a point, where I know that
$\sigma^2(b) = \sigma^2(X'X)^{-1}=\sigma^2(Var(X)(n-1))^{-1}$,
but typically matrix formulas for the standard error of $b$ then note that $\sigma^2$, the unknown variance of the errors, is estimated with MSE.
Based on comments by whuber, sum of squares residuals is:
$Y'Y – bX'X$, which according to Greg can be spelled out as
$Var(Y)(n-1) – b(Var(x)(n-1))$
And to go from here to MSE, I think (?) all that is left is divide by n-2,
so
$(Var(Y)(n-1) – b(Var(x)(n-1))) / (n-2)$
And plugging this all in yields:
$\sigma^2(b) =((Var(Y)(n-1) – b(Var(X)(n-1))) / (n-2))\times(Var(X)(n-1))^{-1}$
I will try to see if this works out in R, and will report back.
If anyone sees something blatantly wrong, I appreciate the heads up.
Best Answer
With the usual notation, arrange the independent variables by columns in an $n\times (p+1)$ array $X$, with one of them filled with the constant $1$, and arrange the dependent values $Y$ into an $n$-vector (also a column). Assuming the appropriate inverses exist, the desired formulas are
$$b = (X^\prime X)^{-1} X^\prime Y$$
for the $p+1$ parameter estimates $b$,
$$s^2 = (Y^\prime Y - b^\prime X^\prime Y)/(n-p-1)$$
for the residual standard error, and
$$V = (X^\prime X)^{-1} s^2$$
for the variance-covariance matrix of $b$.
When only the summary statistics $\newcommand{\m}{\mathrm m} \m_X$ (the means of the columns of $X$, forming a $p+1$-covector which includes a mean of $1$ for the constant), $m_Y$ (the mean of $Y$), $\text{Cov}(Y)$, $\text{Var}(Y)$, and $\text{Cov}(X,Y)$ are available, first recover the needed values via
$$(X^\prime X)_0 = (n-1)\text{Cov}(X) + n \m_X \m_X^\prime,$$
$$Y^\prime Y = (n-1)\text{Var}(Y) + n m_Y,$$
$$(X^\prime Y)_0 = (n-1)\text{Cov}(X,Y) + n m_Y \m_X.$$
Then to obtain $X^\prime X$, border $(X^\prime X)_0$ symmetrically by a vector of column sums (given by $n \m_X$) with the value $n$ on the diagonal; and to obtain $X^\prime Y$, augment the vector $(X^\prime Y)_0$ with the sum $n m_Y$. For instance, when using the first column for the constant term, these bordered matrices will look like
$$X^\prime X = \pmatrix{ n & n\m_X \\ n\m_X^\prime & (X^\prime X)_0 }$$
and
$$X^\prime Y = \left(n m_Y, (X^\prime Y)_0\right)^\prime$$
in block-matrix form.
If the means are not available--the question did not indicate they are--then replace them all with zeros. The output will estimate an "intercept" of $0$, of course, and its standard error of the intercept will likely be incorrect, but the remaining coefficient estimates and standard errors will be correct.
Code
The following
R
code generates data, uses the preceding formulas to compute $b$, $s^2$, and the diagonal of $V$ from only the means and covariances of the data (along with the values of $n$ and $p$ of course), and compares them to standard least-squares output derived from the data. In all examples I have run (including multiple regression with $p\gt 1$) agreement is exact to the default output precision (about seven decimal places).For simplicity--to avoid doing essentially the same set of operations three times--this code first combines all the summary data into a single matrix
v
and then extracts $X^\prime X$, $X^\prime Y$, and $Y^\prime Y$ from its entries. The comments note what is happening at each step.