R Regression – How to Manually Calculate DFBETAS for Diagnostic Purposes

diagnosticrregressionregression coefficients

I am trying to replicate what the function dfbetas() does in R.

dfbeta() is not an issue… Here is a set of vectors:

x <- c(0.512, 0.166, -0.142, -0.614, 12.72)
y <- c(0.545, -0.02, -0.137, -0.751, 1.344)

If I fit two regression models as follows:

fit1 <- lm(y ~ x)
fit2 <- lm(y[-5] ~ x[-5])

I see that eliminating the last point results in a very different slope (blue line – steeper):

enter image description here

This is reflected in the change in slopes:

fit1$coeff[2] - fit2$coeff[2]
-0.9754245

which coincides with the dfbeta(fit1) for the fifth value:

   (Intercept)            x
1  0.182291949 -0.011780253
2  0.020129324 -0.001482465
3 -0.006317008  0.000513419
4 -0.207849024  0.019182219
5 -0.032139356 -0.975424544

Now if I want to standardize this change in slope (obtain dfbetas) and I resort to:

Williams, D. A. (1987) Generalized linear model diagnostics using the
deviance and single case deletions. Applied Statistics 36, 181–191

which I think may be one of the references in the R documentation under the package {stats}. There the formula for dfbetas is:

$\large \mathrm{dfbetas} (i, \mathrm{fit}) = \Large {(\hat{b} – \hat{b}_{-i})\over \mathrm{SE}\, \hat{b}_{-i}}$

This could be easily calculated in R:

(fit1$coef[2] - fit2$coef[2])/summary(fit2)$coef[4]

yielding: -6.79799

The question is why I am not getting the fifth value for the slope in:

dfbetas(fit1)

  (Intercept)            x
1  1.06199661  -0.39123009
2  0.06925319  -0.02907481
3 -0.02165967   0.01003539
4 -1.24491242   0.65495527
5 -0.54223793 -93.81415653!

What is the right equation to go from dfbeta to dfbetas?

Best Answer

$DFBETAS_{k(i)}$ is calculated by:

$b_k-b_{k(i)}\over{\sqrt{MSE_{(i)}c_{kk}}}$, for $k$ = 1, 2, . . . , $p$.

where $b_k$ is the $k$th regression coefficient that uses all the data and $b_{k(i)}$ is the same coefficient with the $i$th case deleted. $MSE_{(i)}$ here is the mean-square error from the regression where the $i$ case is deleted and $c_{kk}$ is the $k$th diagonal element of the unscaled covariance matrix $(X^{\prime}X)^{-1}$.

So you can calculate $DFBETAS_{k(i)}$ manually with the following R code:

numerator<-(fit1$coef[2] - fit2$coef[2])
denominator<-sqrt((summary(fit2)$sigma^2)*diag(summary(fit1)$cov.unscaled)[2])
DFBETAS<-numerator/denominator
DFBETAS
        x 
-93.81416

Related Solutions

Solved – Understanding the formula of dfbetas

The denominator corresponds to the calculated standard error for the slope, $\hat \beta$, of the regression line, as stated in the equation:

$\mathrm{dfbetas} ((-i),\mathrm{fit}) = \Large {\hat{b}_k - \hat{b}_{k(-i)}\over \mathrm{SE}\, \hat{b}_{(-i)}}\tag1$

Where $\hat{b}_k$ is the $k$-th regression coefficient, and $\hat{b}_{k(-i)}$ stands for the $k$-th regression coefficient after extracting the $i$-th entry point in the data.

Now, credit where credit is due, I have broken down the problem of calculating the denominator in the toy data presented in the question you link to in the OP, feeding on the answer to the question you are linking, as well as the beautiful explanation of standard errors of coefficients by @ocram. To make things easy, here is the original set up:

x <- c(0.512, 0.166, -0.142, -0.614, 12.72)
y <- c(0.545, -0.02, -0.137, -0.751, 1.344)

With two regression models, one including all the data points, and a second one, extracting the last data point:

fit1 <- lm(y ~ x)
fit2 <- lm(y[-5] ~ x[-5])

So we want to calculate the standard error of the estimated slope, or in other words, $\hat{b}_k = \hat{b}_2$ since the first coefficient is the intersect.

In general,

$\mathrm{SE}\, \hat{b}_{-i} = \sqrt{\mathrm{MSE_{(-i)}}\cdot \mathrm{c_{kk}}}\tag2$

with $\mathrm{MSE_{(-i)}}$ standing for the mean-square error of the regression with the $i$-th point in the data cloud erased, and $\mathrm{c_{kk}}$ corresponding to the $k$-th element of the diagonal of the unscaled covariance matrix, in our case, $\mathrm{c_{22}}$.

The $\mathrm{MSE_{(-i)}}$ corresponds to the estimate of the the variance of the residuals in the linear model: $\epsilon\sim N(0,\sigma^2I)$, or $\hat\sigma^2$. In this case the model refers to the OLS without the fifth data point.

To calculate $\mathrm{MSE_{(-i)}}$:

$\mathrm{MSE_{(-i)}} = \hat\sigma_{(-i)}^2 =\frac{\displaystyle\sum_{i=i}^{i=4}(y_i\,-\,\hat y_i)^2}{df}$. So it is the sum of the squared differences between the actual $y$ values and the fitted values, divided by the degrees of freedom, which turn out to be $2$, since we have $4$ original data points left after getting rid of the fifth one, and we have lost two degrees of freedom in the numerator.

In R we can calculate this as follows:

# First manually:
            ANOVA_Sum_Sq <- sum((y[-5] - fitted(fit2))^2)
            df <- length(y[-5]) - 2
            (ANOVA_Mean_Sq <- ANOVA_Sum_Sq/df)
    [1] 0.01411015

# ... and now with built-in formulas:
        summary(fit2)$sigma^2
        # ... or ...
        anova(fit2)[[3]][2]
        # ... or simply checking the Mean Sq of the residuals of anova(fit2):
        anova(fit2)
        Analysis of Variance Table

        Response: y[-5]
                  Df  Sum Sq Mean Sq F value Pr(>F)  
        x[-5]      1 0.81903 0.81903  58.046 0.0168 *
        Residuals  2 0.02822 0.01411

Now that we have this part (corresponding to $\mathrm{MSE_{(-i)}}$ in equation [2], we need to look into the $(X'X)^{-1}$ matrix for the original data to calculate the second part: $\mathrm{c_{kk}}$. In the case of one single explanatory variable or regressor, the equation is nicely provided in the linked posts, but I'll paste it here to make it simple:

$(\mathbf{X}^{\prime} \mathbf{X})^{-1} = \frac{1}{n\sum x_i^2 - (\sum x_i)^2} \left( \begin{array}{cc} \sum x_i^2 & -\sum x_i \\ -\sum x_i & n \end{array} \right)$

The determinant corresponds to:

(w <- 1/((length(x) * (sum((x)^2))) - ((sum(x))^2))) [1] 0.0004025765, and n = 5 (the number of original data points). Hence, (c_22 <- w * length(x)) [1] 0.007661589

So now we can calculate the denominator as:

$\mathrm{SE}\, \hat{b}_{-i} = \sqrt{\mathrm{MSE_{(-i)}}\cdot \mathrm{c_{22}}}=\sqrt{0.01411015*0.007661589} = 0.01039741$.

Done. To prove that this is correct we can quickly compute manually equation [1]:

(fit1$coef[2] - fit2$coef[2])/0.01039741
        x 
-93.81418

Corresponding to the dfbetas slope extracting the fifth data point:

(dfbetas(fit1))
  (Intercept)            x
1  1.06199661  -0.39123009
2  0.06925319  -0.02907481
3 -0.02165967   0.01003539
4 -1.24491242   0.65495527
5 -0.54223793 -93.81415653

So we have preserved the original model matrix ($X$) and the unscaled covariance matrix ($(X'X)^{-1})$ of the original data, but we have changed $\hat\sigma^2$ to reflect the exclusion of a data point.

R Survival – How to Manually Calculate `survfit` in Cox Hazard Model

Model

library(survival)
fit <- coxph(Surv(time, status) ~ age, data = kidney)

Goal

Obtain an estimate of the survival function over time at 60 years old, $\hat{S}(\cdot \,|\, \text{age} = 60)$.

Solution #1

summary(survfit(fit, newdata = data.frame(age = 60)))

Solution #2

As explained in this post, basehaz(fit, centered = FALSE) returns a non-parametric estimate of the cumulative baseline hazard, $\hat{H}_0(t)$:

H0 <- basehaz(fit, centered = FALSE)

Thus, an estimate of the cumulative hazard at 60 years old can be obtained by $\hat{H}(t) = \hat{H}_0(t) \exp(60 \, \beta_{\text{age}})$:

H <- H0$hazard * exp(60 * fit$coefficients)

Finally, an estimate of the survival function is $\hat{S}(t) = \exp(-\hat{H}(t))$:

S <- exp(- H) 

> data.frame(time = H0$time, Surv = S)
   time        Surv
1     2 0.985953977
2     4 0.985953977
3     5 0.985953977
4     6 0.985953977
5     7 0.956257141

Remark

Note that the second method returns the estimate of the survival at every time point found in the data set while the first method returns the estimate of the survival only at the event times.

Best Answer

Related Solutions

Solved – Understanding the formula of dfbetas

R Survival – How to Manually Calculate `survfit` in Cox Hazard Model

Related Question