Confidence Intervals – Calculating Confidence Intervals Around Functions of Estimated Parameters

confidence intervalfittingnormal distributionstandard deviationstandard error

Suppose I estimate the following model,
$$
y = \hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2\ ,
$$
and am additionally interested in the quantity
$$
\gamma = \sqrt{\beta_0^2 – 4\beta_1\beta_2}\ .
$$
An estimate of the above, $\hat{\gamma}$, is clearly given by inserting the estimated $\hat\beta_i$'s in the above formula.

But how does one go about computing the confidence interval around such a "compound" quantity that is not directly estimated?

My attempt.

My "empirical" approach is as follows. I generate a multivariate normal distribution of size $n$ using the estimated means $\hat\beta_i$'s and the covariance matrix of the fit. I consequently have a distribution of size $n$ for the $\gamma$. The mean of this distribution $\bar\gamma$ is naturally then close to the $\hat\gamma$ computed above, and I now additionally have a standard deviation of this distribution $\sigma$.

Using $\sigma$ to generate a confidence interval.

If I use the standard definition $\bar\gamma \pm t(\alpha, n-1) \sigma / \sqrt{n}$, where $t$ is the PPF of the $t$-distribution, I run into the following quandary: since I pick $n$ myself, I can in fact make the standard error as small as I like. This is not what I want: I am looking for a "natural" confidence interval which depends solely on the uncertainty in my estimates of the $\beta$'s. Therefore, my idea is the following: $\sigma$ stays positive, and seems to "converge" to its true value for large $n$: hence, 95% of the values in my $\gamma$ distribution lie within the interval $\bar\gamma \pm 2\sigma$.

Is this approach valid?

Best Answer

Two common approaches for this problem are to calculate the non-linear combination of the coefficients directly from the regression or to bootstrap it.

The variance in the former is based on the "delta method", an approximation appropriate in large samples. This was suggested in the other answer, but statistics software can make the calculation a whole lot easier.

The variance for the latter comes from resampling the data in memory with replacement, fitting the model, calculating the coefficient combination, and then using the sampled distribution to get the confidence interval.

Here's an example of both using Stata:

. sysuse auto, clear
(1978 automobile data)

. set seed 11082021

. regress price mpg foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.07
       Model |   180261702         2  90130850.8   Prob > F        =    0.0000
    Residual |   454803695        71  6405685.84   R-squared       =    0.2838
-------------+----------------------------------   Adj R-squared   =    0.2637
       Total |   635065396        73  8699525.97   Root MSE        =    2530.9

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -294.1955   55.69172    -5.28   0.000    -405.2417   -183.1494
     foreign |   1767.292    700.158     2.52   0.014     371.2169    3163.368
       _cons |   11905.42   1158.634    10.28   0.000     9595.164    14215.67
------------------------------------------------------------------------------

. nlcom (gamma_dm:sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign]))

    gamma_dm: sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    gamma_dm |   1446.245   361.0078     4.01   0.000     738.6823    2153.807
------------------------------------------------------------------------------

The 95% CI using the delta method is [738.6823, 2153.807].

Boostrapping yields [740.5149, 2151.974], which is fairly similar:

. bootstrap (gamma_bs:sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])), reps(500) nodots: regress price mpg foreign

Linear regression                                          Number of obs =  74
                                                           Replications  = 499

        Command: regress price mpg foreign
[gamma_bs]_bs_1: sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
gamma_bs     |
       _bs_1 |   1446.245   360.0728     4.02   0.000     740.5149    2151.974
------------------------------------------------------------------------------
Note: One or more parameters could not be estimated in 1 bootstrap replicate;
      standard-error estimates include only complete replications.

Your Solution

Your proposed solution would work if you have lots of data, but here it does not do so well with only 74 observations:

. quietly regress price mpg foreign

. corr2data b_mpg b_foreign b_cons, n(500) means(e(b)) cov(e(V)) clear
(obs 500)

. gen gamma_sim = sqrt(b_cons - 4*b_mpg*b_foreign)
(3 missing values generated)

. sum gamma_sim

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   gamma_sim |        497    1426.183    366.6408   197.5594   2397.263

. display "[" %-9.4f r(mean) + invttail(r(N)-1,.975)*r(sd) ", " r(mean) + invttail(r(N)-1,.025)*r(sd) "]"
[705.8224 , 2146.5434]

The CI here is [705.8224 , 2146.5434], which is noticeably different from the two CIs above.

My thought is that if you are going to simulate, you might as well bootstrap and not rely on the normal approximation that is only valid asymptotically. If you have lots of data, the difference between bootstrapping and sampling from MVN parameterized by estimated coefficients and variance should not be noticeable.

Related Solutions

Solved – Does more variables mean tighter confidence intervals

do I ALWAYS get a tighter confidence interval if I include more variables in my model?

Yes, you do (EDIT: ...basically. Subject to some caveats. See comments below). Here's why: adding more variables reduces the SSE and thereby the variance of the model, on which your confidence and prediction intervals depend. This even happens (to a lesser extent) when the variables you are adding are completely independent of the response:

a=rnorm(100)
b=rnorm(100)
c=rnorm(100)
d=rnorm(100)
e=rnorm(100)

summary(lm(a~b))$sigma       # 0.9634881
summary(lm(a~b+c))$sigma     # 0.961776
summary(lm(a~b+c+d))$sigma   # 0.9640104 (Went up a smidgen)
summary(lm(a~b+c+d+e))$sigma # 0.9588491 (and down we go again...)

But this does not mean you have a better model. In fact, this is how overfitting happens.

Consider the following example: let's say we draw a sample from a quadratic function with noise added.

enter image description here

A first order model will fit this poorly and have very high bias.

enter image description here

A second order model fits well, which is not surprising since this is how the data was generated in the first place.

enter image description here

But let's say we don't know that's how the data was generated, so we fit increasingly higher order models.

enter image description here

As the complexity of the model increases, we're able to capture more of the fluctuations in the data, effectively fitting our model to the noise, to patterns that aren't really there. With enough complexity, we can build a model that will go through each point in our data nearly exactly.

enter image description here

As you can see, as the order of the model increases, so does the fit. We can see this quantitatively by plotting the training error:

enter image description here

But if we draw more points from our generating function, we will observe the test error diverges rapidly.

enter image description here

The moral of the story is to be wary of overfitting your model. Don't just rely on metrics like adjusted-R2, consider validating your model against held out data (a "test" set) or evaluating your model using techniques like cross validation.

For posterity, here's the code for this tutorial:

set.seed(123)
xv = seq(-5,15,length.out=1e4)
X=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
Y=gen(X)
df = data.frame(x=X,y=Y)
plot(X,Y)
lines(xv,xv^2, col="blue") # true model
legend("topleft", "True Model", lty=1, col="blue")    

build_formula = function(N){ 
  paste('y~poly(x,',N,',raw=T)')
}

deg=c(1,2,10,20)
formulas = sapply(deg[2:4], build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
                            })
# Progressively add fit lines to the plot
lapply(1:length(pred), function(i){
  plot(df, main=paste(deg[i],"-Degree"))
  lapply(1:i,function(n){
    lines(xv,pred[[n]], col=n)
  })
  })

# let's actually generate models from poly 1:20 to calculate MSE
deg=seq(1,20)
formulas = sapply(deg, build_formula)
pred.train = lapply(formulas
                   ,function(f){
                     predict(
                       lm(f, data=df)
                       ,newdata=list(x=df$x))
                   })

pred.test = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
              })

rmse.train = unlist(lapply(pred.train,function(P){
  regr.eval(df$y,P, stats="rmse")
}))    

yv=gen(xv)
rmse.test = unlist(lapply(pred.test,function(P){
  regr.eval(yv,P, stats="rmse")
}))    

plot(rmse.train, type='l', col='blue'
     , main="Training Error"
     ,xlab="Model Complexity")

plot(rmse.test, type='l', col='red'
     , main="Train vs. Test Error"
     ,xlab="Model Complexity")
lines(rmse.train, type='l', col='blue')
legend("topleft", c("Test","Train"), lty=c(1,1), col=c("red","blue"))

Solved – Confidence interval for the standard deviation of a Normal distribution with known mean

After looking around for a while without finding anything satisfactory, this is the best answer that seems to make sense to me:

Notice that the sampling distribution of the unknown $\sigma$ is not Normal, so $Q = \frac{\bar{Y} - \mu_{0}}{\sigma / \sqrt{n}}$ does not actually follow $N(0, 1)$, thus it cannot be a pivotal quantity, at least not a pivotal quantity with a Normal distribution.

The case for unknown $\mu$ however, is different because the sampling distribution of $\mu$ is Normal by Central Limit Theorem, so we know $Q = \frac{\bar{Y} - \mu}{\sigma_{0} / \sqrt{n}}$ follows $N(0, 1)$, which makes it a pivotal quantity.

This is why to find the confidence interval for $\sigma$, we have to use the pivotal quantity $$\frac{(n-1)S^2}{\sigma^2},$$ which follows a $\chi^2$ distribution with $n-1$ degrees of freedom.

Best Answer

Related Solutions

Solved – Does more variables mean tighter confidence intervals

Solved – Confidence interval for the standard deviation of a Normal distribution with known mean

Related Question