Confidence Intervals – Calculating Confidence Intervals Around Functions of Estimated Parameters

confidence intervalfittingnormal distributionstandard deviationstandard error

Suppose I estimate the following model,
$$
y = \hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2\ ,
$$

and am additionally interested in the quantity
$$
\gamma = \sqrt{\beta_0^2 – 4\beta_1\beta_2}\ .
$$

An estimate of the above, $\hat{\gamma}$, is clearly given by inserting the estimated $\hat\beta_i$'s in the above formula.

But how does one go about computing the confidence interval around such a "compound" quantity that is not directly estimated?

My attempt.

My "empirical" approach is as follows. I generate a multivariate normal distribution of size $n$ using the estimated means $\hat\beta_i$'s and the covariance matrix of the fit. I consequently have a distribution of size $n$ for the $\gamma$. The mean of this distribution $\bar\gamma$ is naturally then close to the $\hat\gamma$ computed above, and I now additionally have a standard deviation of this distribution $\sigma$.

Using $\sigma$ to generate a confidence interval.

If I use the standard definition $\bar\gamma \pm t(\alpha, n-1) \sigma / \sqrt{n}$, where $t$ is the PPF of the $t$-distribution, I run into the following quandary: since I pick $n$ myself, I can in fact make the standard error as small as I like. This is not what I want: I am looking for a "natural" confidence interval which depends solely on the uncertainty in my estimates of the $\beta$'s. Therefore, my idea is the following: $\sigma$ stays positive, and seems to "converge" to its true value for large $n$: hence, 95% of the values in my $\gamma$ distribution lie within the interval $\bar\gamma \pm 2\sigma$.

Is this approach valid?

Best Answer

Two common approaches for this problem are to calculate the non-linear combination of the coefficients directly from the regression or to bootstrap it.

The variance in the former is based on the "delta method", an approximation appropriate in large samples. This was suggested in the other answer, but statistics software can make the calculation a whole lot easier.

The variance for the latter comes from resampling the data in memory with replacement, fitting the model, calculating the coefficient combination, and then using the sampled distribution to get the confidence interval.

Here's an example of both using Stata:

. sysuse auto, clear
(1978 automobile data)

. set seed 11082021

. regress price mpg foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.07
       Model |   180261702         2  90130850.8   Prob > F        =    0.0000
    Residual |   454803695        71  6405685.84   R-squared       =    0.2838
-------------+----------------------------------   Adj R-squared   =    0.2637
       Total |   635065396        73  8699525.97   Root MSE        =    2530.9

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -294.1955   55.69172    -5.28   0.000    -405.2417   -183.1494
     foreign |   1767.292    700.158     2.52   0.014     371.2169    3163.368
       _cons |   11905.42   1158.634    10.28   0.000     9595.164    14215.67
------------------------------------------------------------------------------

. nlcom (gamma_dm:sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign]))

    gamma_dm: sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    gamma_dm |   1446.245   361.0078     4.01   0.000     738.6823    2153.807
------------------------------------------------------------------------------

The 95% CI using the delta method is [738.6823, 2153.807].

Boostrapping yields [740.5149, 2151.974], which is fairly similar:

. bootstrap (gamma_bs:sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])), reps(500) nodots: regress price mpg foreign

Linear regression                                          Number of obs =  74
                                                           Replications  = 499

        Command: regress price mpg foreign
[gamma_bs]_bs_1: sqrt(_b[_cons] - 4*_b[mpg]*_b[foreign])

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
gamma_bs     |
       _bs_1 |   1446.245   360.0728     4.02   0.000     740.5149    2151.974
------------------------------------------------------------------------------
Note: One or more parameters could not be estimated in 1 bootstrap replicate;
      standard-error estimates include only complete replications.

Your Solution

Your proposed solution would work if you have lots of data, but here it does not do so well with only 74 observations:

. quietly regress price mpg foreign

. corr2data b_mpg b_foreign b_cons, n(500) means(e(b)) cov(e(V)) clear
(obs 500)

. gen gamma_sim = sqrt(b_cons - 4*b_mpg*b_foreign)
(3 missing values generated)

. sum gamma_sim

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   gamma_sim |        497    1426.183    366.6408   197.5594   2397.263

. display "[" %-9.4f r(mean) + invttail(r(N)-1,.975)*r(sd) ", " r(mean) + invttail(r(N)-1,.025)*r(sd) "]"
[705.8224 , 2146.5434]

The CI here is [705.8224 , 2146.5434], which is noticeably different from the two CIs above.

My thought is that if you are going to simulate, you might as well bootstrap and not rely on the normal approximation that is only valid asymptotically. If you have lots of data, the difference between bootstrapping and sampling from MVN parameterized by estimated coefficients and variance should not be noticeable.

Related Question