Regression Confidence-Interval – How to Calculate Confidence Interval of a Regression Prediction for a Binary Predictor

confidence intervalregressionself-study

I have a regression predicting wage (dependent variable) from whether the participant is a college graduate (dummy variable, independent).

I have the regression coefficients and their standard errors, the R-squared of the regression, and the covariance between the intercept and the coefficient on the dummy variable.

How do I calculate a confidence interval for the average wage of a college graduate (given that dummy = 1)?

Best Answer

Understanding what's going on comes down to appreciating the distinctions between parameters, random variables, and realizations of random variables. Getting an answer comes down to using this understanding to identify the pieces of information that are needed and knowing enough about the computer output to find those pieces.


Your model is in the form

$$\text{wage} = \beta_0 + \beta_1 \text{[college graduate]}.$$

This assumes $\text{[college graduate]}$ is coded as $1$ for grads and $0$ for non-grads.

The first step in solving such problems is to figure out what combination of coefficients corresponds to what you're estimating. Because that's the easy part, and it will be hard to go on without answering it, I will point out that the "average wage of a college graduate" is obtained by plugging in the dummy value for college grads, giving $\beta_0 + \beta_1 \times 1$ = $\beta_0 + \beta_1$.

The second step is recognizing that you do not know either $\beta_0$ or $\beta_1$. You estimate them from the data. The regression results include their estimates, which we may call $b_0$ and $b_1$, respectively. Because they are likely to differ from the betas, we model the wages as random variables. Because both $b_0$ and $b_1$ are computed from these data, they are realizations of random variables, too. Let's call these random variables $B_0$ and $B_1$. Got that?

OK, if not, let's note that if you have standardized your x-values in the regression (to make the formulas simpler), then the formula for $b_0$ is that it's the average of all the $y$-values:

$$b_0 = \frac{1}{n}\sum_i y_i.$$

We are viewing the $y_i$ as realizations of random variables $Y_i$ (the wages for each subject in the dataset). Thus, $b_0$ is the realization of the average random variable

$$B_0 = \frac{1}{n}\sum_i Y_i.$$

When you make assumptions about the distributions of the $Y_i$, this formula lets you determine how those assumptions affect the distribution of $B_0$. You would usually be most interested in the expected value and the variance of $B_0$: the expectation tells you what $\beta_0$ ought to be, more or less, and the variance (once you take its square root) tells you how close $b_0$ ought to come to $\beta_0$.

Similar reasoning (but with more complicated formulas) holds for $\beta_1$.

(Fortunately, you will not need to work through all the mathematics to relate the expectations and variances of the $B_i$ to those of the $Y_i$: the regression procedure does that for you.)

The upshot is that the fitted regression coefficients $b_0$ and $b_1$ are realizations of random variables. You are interested in how much $b_0 + b_1$ might vary, because (obviously) this is your estimate of $\beta_0 + \beta_1$. To that end you would want to find out two things (the third step):

  1. What is the expected value of $B_0 + B_1$? (Hint: regression theory tells you what this is in terms of $\beta_0$ and $\beta_1$.)

  2. How much should $B_0 + B_1$ vary around its expectation?

To answer #2, you would like to find the variance of the random variable $B_0+B_1$ (and then take its square root). You know, from basic principles governing covariances, that the variance of $B_0+B_1$ can be computed from the variance-covariance matrix of $(B_0, B_1)$, and you should know how to do this.

This reasoning reduces the problem to:

Using the regression output, how can you reconstruct the variance-covariance matrix of the coefficients?

Related Question