I am trying to replicate what the function dfbetas() does in R.

dfbeta() is not an issue… Here is a set of vectors:

x <- c(0.512, 0.166, -0.142, -0.614, 12.72)
y <- c(0.545, -0.02, -0.137, -0.751, 1.344)

If I fit two regression models as follows:

fit1 <- lm(y ~ x)
fit2 <- lm(y[-5] ~ x[-5])

I see that eliminating the last point results in a very different slope (blue line – steeper):

This is reflected in the change in slopes:

fit1$coeff[2] - fit2$coeff[2]

which coincides with the dfbeta(fit1) for the fifth value:

   (Intercept)            x
1  0.182291949 -0.011780253
2  0.020129324 -0.001482465
3 -0.006317008  0.000513419
4 -0.207849024  0.019182219
5 -0.032139356 -0.975424544

Now if I want to standardize this change in slope (obtain dfbetas) and I resort to:

which I think may be one of the references in the R documentation under the package {stats}. There the formula for dfbetas is:

$\large \mathrm{dfbetas} (i, \mathrm{fit}) = \Large {(\hat{b} – \hat{b}_{-i})\over \mathrm{SE}\, \hat{b}_{-i}}$

This could be easily calculated in R:

(fit1$coef[2] - fit2$coef[2])/summary(fit2)$coef[4]

yielding: -6.79799

The question is why I am not getting the fifth value for the slope in:


  (Intercept)            x
1  1.06199661  -0.39123009
2  0.06925319  -0.02907481
3 -0.02165967   0.01003539
4 -1.24491242   0.65495527
5 -0.54223793 -93.81415653!

What is the right equation to go from dfbeta to dfbetas?

$DFBETAS_{k(i)}$ is calculated by:

$b_k-b_{k(i)}\over{\sqrt{MSE_{(i)}c_{kk}}}$, for $k$ = 1, 2, . . . , $p$.

where $b_k$ is the $k$th regression coefficient that uses all the data and $b_{k(i)}$ is the same coefficient with the $i$th case deleted. $MSE_{(i)}$ here is the mean-square error from the regression where the $i$ case is deleted and $c_{kk}$ is the $k$th diagonal element of the unscaled covariance matrix $(X^{\prime}X)^{-1}$.

So you can calculate $DFBETAS_{k(i)}$ manually with the following R code:

numerator<-(fit1$coef[2] - fit2$coef[2])