Solved – Understanding the formula of dfbetas

regressionregression coefficients

I'm referring to the formula used in the answer here.

The numerator in the formula for dfbetas is straight forward: the difference between the value of the coefficient for a regression model that doesn't have a particular observation and the value of the coefficient for the model that has it.

I'm having a hard time understanding the denominator.

Is it scaled by the standard error for the full model, or the standard error of the value of that particular coefficient? Or something else.

Best Answer

The denominator corresponds to the calculated standard error for the slope, $\hat \beta$, of the regression line, as stated in the equation:

$\mathrm{dfbetas} ((-i),\mathrm{fit}) = \Large {\hat{b}_k - \hat{b}_{k(-i)}\over \mathrm{SE}\, \hat{b}_{(-i)}}\tag1$

Where $\hat{b}_k$ is the $k$-th regression coefficient, and $\hat{b}_{k(-i)}$ stands for the $k$-th regression coefficient after extracting the $i$-th entry point in the data.

Now, credit where credit is due, I have broken down the problem of calculating the denominator in the toy data presented in the question you link to in the OP, feeding on the answer to the question you are linking, as well as the beautiful explanation of standard errors of coefficients by @ocram. To make things easy, here is the original set up:

x <- c(0.512, 0.166, -0.142, -0.614, 12.72)
y <- c(0.545, -0.02, -0.137, -0.751, 1.344)

With two regression models, one including all the data points, and a second one, extracting the last data point:

fit1 <- lm(y ~ x)
fit2 <- lm(y[-5] ~ x[-5])

So we want to calculate the standard error of the estimated slope, or in other words, $\hat{b}_k = \hat{b}_2$ since the first coefficient is the intersect.

In general,

$\mathrm{SE}\, \hat{b}_{-i} = \sqrt{\mathrm{MSE_{(-i)}}\cdot \mathrm{c_{kk}}}\tag2$

with $\mathrm{MSE_{(-i)}}$ standing for the mean-square error of the regression with the $i$-th point in the data cloud erased, and $\mathrm{c_{kk}}$ corresponding to the $k$-th element of the diagonal of the unscaled covariance matrix, in our case, $\mathrm{c_{22}}$.

The $\mathrm{MSE_{(-i)}}$ corresponds to the estimate of the the variance of the residuals in the linear model: $\epsilon\sim N(0,\sigma^2I)$, or $\hat\sigma^2$. In this case the model refers to the OLS without the fifth data point.

To calculate $\mathrm{MSE_{(-i)}}$:

$\mathrm{MSE_{(-i)}} = \hat\sigma_{(-i)}^2 =\frac{\displaystyle\sum_{i=i}^{i=4}(y_i\,-\,\hat y_i)^2}{df}$. So it is the sum of the squared differences between the actual $y$ values and the fitted values, divided by the degrees of freedom, which turn out to be $2$, since we have $4$ original data points left after getting rid of the fifth one, and we have lost two degrees of freedom in the numerator.

In R we can calculate this as follows:

# First manually:
            ANOVA_Sum_Sq <- sum((y[-5] - fitted(fit2))^2)
            df <- length(y[-5]) - 2
            (ANOVA_Mean_Sq <- ANOVA_Sum_Sq/df)
    [1] 0.01411015

# ... and now with built-in formulas:
        summary(fit2)$sigma^2
        # ... or ...
        anova(fit2)[[3]][2]
        # ... or simply checking the Mean Sq of the residuals of anova(fit2):
        anova(fit2)
        Analysis of Variance Table

        Response: y[-5]
                  Df  Sum Sq Mean Sq F value Pr(>F)  
        x[-5]      1 0.81903 0.81903  58.046 0.0168 *
        Residuals  2 0.02822 0.01411 

Now that we have this part (corresponding to $\mathrm{MSE_{(-i)}}$ in equation [2], we need to look into the $(X'X)^{-1}$ matrix for the original data to calculate the second part: $\mathrm{c_{kk}}$. In the case of one single explanatory variable or regressor, the equation is nicely provided in the linked posts, but I'll paste it here to make it simple:

$(\mathbf{X}^{\prime} \mathbf{X})^{-1} = \frac{1}{n\sum x_i^2 - (\sum x_i)^2} \left( \begin{array}{cc} \sum x_i^2 & -\sum x_i \\ -\sum x_i & n \end{array} \right)$

The determinant corresponds to:

(w <- 1/((length(x) * (sum((x)^2))) - ((sum(x))^2))) [1] 0.0004025765, and n = 5 (the number of original data points). Hence, (c_22 <- w * length(x)) [1] 0.007661589

So now we can calculate the denominator as:

$\mathrm{SE}\, \hat{b}_{-i} = \sqrt{\mathrm{MSE_{(-i)}}\cdot \mathrm{c_{22}}}=\sqrt{0.01411015*0.007661589} = 0.01039741$.

Done. To prove that this is correct we can quickly compute manually equation [1]:

(fit1$coef[2] - fit2$coef[2])/0.01039741
        x 
-93.81418 

Corresponding to the dfbetas slope extracting the fifth data point:

(dfbetas(fit1))
  (Intercept)            x
1  1.06199661  -0.39123009
2  0.06925319  -0.02907481
3 -0.02165967   0.01003539
4 -1.24491242   0.65495527
5 -0.54223793 -93.81415653

So we have preserved the original model matrix ($X$) and the unscaled covariance matrix ($(X'X)^{-1})$ of the original data, but we have changed $\hat\sigma^2$ to reflect the exclusion of a data point.

Related Question