Solved – Confidence interval for polynomial linear regression

confidence intervalregression

I have a model which is not linear but rather polynomial, and I have to estimate the parameters by giving a 95% confidence interval.
There are plenty of formulas for regression of the type $Y = \beta_0 + \beta_1 X$, but do they apply in my case (where $Y = \beta_1 X + \beta_2 X^2$)?

Of course, R gives me a pretty output:

Call:
lm(formula = dN ~ 0 + I(N) + I(N^2))

Residuals:
   1        2        3        4        5        6        7 
 0.02456 -0.10512 -0.12136  0.01848  0.24056 -0.11465  0.02646 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
I(N)    2.977e-02  6.596e-04   45.14 1.01e-07 ***
I(N^2) -4.440e-05  1.770e-06  -25.08 1.88e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1403 on 5 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9989 
F-statistic:  3173 on 2 and 5 DF,  p-value: 1.739e-08

I have read on some PDF file (page 13) that one can simply get the confidence intervals by taking the standard error (given by R): $\hat{\beta_1} \pm t_{\alpha/2} \times Std. Error$. Does it always hold?

In the same way, are the confidence intervals for the model prediction the same?

Thank you in advance for any clarification.

Best Answer

Polynomial regression is in effect multiple linear regression: consider $X_1=X$ and $X_2=X^2$ -- then $E(Y) = \beta_1 X + \beta_2 X^2$ is the same as $E(Y) = \beta_1 X_1 + \beta_2 X_2$.

As such, methods for constructing confidence intervals for parameters (and for the mean in multiple regression) carry over directly to the polynomial case. Most regression packages will compute this for you. Yes, it can be done using the formula you suggest (if the assumptions needed for the t-interval to apply hold), and the right d.f. are used for the $t$ (the residual d.f. - which in R is available from the summary output).

The R function confint can be used to construct confidence intervals for parameters from a regression model. See ?confint.

In the case of a confidence interval for the conditional mean, let $X$ be the matrix of predictors, whether for polynomial regression or any other multiple regression model; let the estimated variance of the mean at $x_i=(x_{1i},x_{2i},...,x_{pi})$ be $v_i=\hat{\sigma}^2x_i(X'X)^{-1}x_i'$ and let $s_i=\sqrt v_i$ be the corresponding standard error. Let the upper $\alpha/2$ $t$ critical value for $n-p-1$ df be $t$. Then the pointwise confidence interval for the mean at $x_i$ is $\hat{y}_i\pm t\cdot s$.

Also, the R function predict can be used to construct CIs for E(Y|X) - see ?predict.lm.

[At least when doing polynomial regression with an intercept, it makes sense to use orthogonal polynomials but if the spread of $X$ is large compared to the mean, and the degree is low (such as quadratic), it won't be so critical (I tend to do so anyway, because it's easier to interpret the linear and quadratic).]

Related Solutions

Solved – Comparing two linear regression models

If you set up the data in one long column with A and B as a new column, you then can run your regression model as a GLM with a continuous time variable and a nominal "experiment" variable (A, B). The output of the ANOVA will give you the significance of the difference between the parameters. "intercept' is the common intercept and the "experiment" factor will reflect differences between the intercepts (actually overall means) between the experiments. the "Time" factor will be the common slope, and the interaction is the difference between the experiments with respect to the slope.

I have to admit I cheat (?) and run the models separately first to get the two sets of parameters and their errors and then run the combined model to acquire the differences between the treatments (in your case A and B)...

Solved – Different regression coefficients in R and Excel

The difference between coefficients is in the relation x versus y which is reversed in the one case.

Note that

in your R case the coefficient relates to 'suva'
and in your Excel case the coefficient relates to 'heather'.

see in the following code where R can get to both cases:

lm(suva ~ heather, data = as.data.frame(data))

Call:
lm(formula = suva ~ heather, data = as.data.frame(data))

Coefficients:
(Intercept)      heather  
      14.65       -13.60  

> lm(heather ~suva, data = as.data.frame(data))

Call:
lm(formula = heather ~ suva, data = as.data.frame(data))

Coefficients:
(Intercept)         suva  
    0.32524     -0.01276

rest of the code:

data <- c(
12.880545,   0.061156645, 0.15   , 0.525,   0,
7.098873327, 0.026878039, 0.2275,  0   ,0,
8.660688381, 0.04037841 , 0.425 ,  0.25 ,   0,
7.734546932, 0.021618446, 0.225 , 0.3875,  0,
16.70696048, 0.103626684, 0.15  ,  0.075,   0,
9.763315183, 0.013387158, 0.25  ,  0.075,   0,
12.91735434, 0.008076468, 0.22  ,  0.22 ,   0,
19.94153851, 0.150798057, 0.0375,  0.35 ,   0.225,
17.25115559, 0.052229596, 0.0625,  0.2625,  0.225,
15.38596941, 0.05429447 , 0.1125,  0.45 ,   0.025,
15.53714185, 0.05933884 , 0.1625,  0.525,   0.0625,
14.11551229, 0.064579437, 0.1875,  0.35 ,   0.1375,
14.88575569, 0.0189853  , 0.3375,  0.3, 0,
12.32229733, 0.043085602, 0.0875,  0.1375,  0,
17.23861185, 0.071705699, 0.15  ,  0.1375,  0,
11.50832463,     0.1125 , 0.0875,  0.075, 0,
14.4810484,  0.078476821, 0.0375,  0.125,   0.0625,
9.110262652, 0.077306938, 0.145 ,  0.35 ,   0.0125,
10.8571733,  0.02681341 , 0.0375,  0.525,   0,
9.589339421, 0.01892435 , 0.2275,  0  , 0,
7.260373588, 0.014538237, 0.425 ,  0.25 ,   0,
11.11099161, 0.022802578, 0.225 ,  0.3875 , 0,
10.81488848, 0.047587818, 0.15  ,  0.075  , 0,
8.224131957, 0.031126904, 0.25  ,  0.075  , 0,
8.818607863, 0.002855409, 0.22  ,  0.22   , 0,
11.53999863, 0.031465613, 0.0375,  0.35   , 0.225,
14.92784964, 0.069998663, 0.0625,  0.2625 , 0.225,
9.666480932, 0.02387741 , 0.1125,  0.45   , 0.025,
12.51000758, 0.016960259, 0.1625,  0.525  , 0.0625,
13.32611463, 0.033670382, 0.1875,  0.35   , 0.1375,
16.76535191, 0.029613698, 0.3375,  0.3 ,0,
11.24615281, 0.008440059, 0.0875,  0.1375,  0,
10.60564875, 0.003930792, 0.15  ,  0.1375,  0,
11.82909125, 0.036017582, 0.1125,  0.0875 , 0.075,
18.2337185,  0.143451512, 0.0375,  0.125  , 0.0625,
10.6226222,  0.020561242, 0.145 ,  0.35   , 0.0125
)
data <- matrix(data,36, byrow=1)
colnames(data) <- c("suva", "Std dev", "heather", "sedge",   "sphagnum")

Why then, is $R^2$ still the same?

There is a certain symmetry in the situation. The regression slope coefficient is (in simple linear regression) the correlation coefficient scaled by the variance of the $x$ and $y$ data.

$$\hat\beta_{y \sim x} = r_{xy} \frac{s_y}{s_x}$$

The regression model variance is then:

$$s_{mod} = \hat\beta_{y \sim x} s_x = r_{xy} s_y$$

and the ratio of model variance and variance of the data is:

$$R^2 = \left( \frac{s_{mod}}{s_y} \right)^2= r_{xy}^2$$

Best Answer

Related Solutions

Solved – Comparing two linear regression models

Solved – Different regression coefficients in R and Excel

Related Question