I have done linear regression and plotted the data, the regression line and also the confidence interval (for 95% confidence). However it seems that most of the data points fall outside the confidence interval. So how am I supposed to interpret the confidence interval. It cannot be I am 95% confident that the data point will be this close to the regression line since a lot more than 5% of the data points do not fall in that area. So what does it mean then?
Regression Analysis – Confidence Intervals for Regression Interpretation
confidence intervalregression
Related Solutions
I understand some of your questions but others are not clear. Let me answer and state some facts and maybe that will clear up all of your confusion.
The fit you have is remarkably good. The confidence intervals should be very tight. There are two typea of confidence regions that can be considered, The bsimultanoues region which is intended to cover the entire true regression function with the given confidence level.
The others which are what you are looking at are the confidence intervals for the fitted regression points. They are only intended to cover the fitted value of y at the given value(s) of the covariate(s). They are not intended to cover y values at other values of the covariates. In fact if the intervals are very tight as they should be in your case they will not cover many if any of the data points as you get away from the fixed value(s) of the covariate(s). For that type of coverage you need to get the simultaneous confidence curves (upper and lower bound curves).
Now it is true that if you predict a y at a given value of a covariate and you want the same confidence level for the prediction interval as you used for the confidence interval for y at the given value of the covariate the interval will be wider. The reason is that the model tells you that there will be added variability because a new y will have its own independent error that must be accounted for in the interval. That error component does not enter into the estimates based on the data used in the fit.
You don't compare the individual points to conclude a treatment effect. You see whether the lines for the treatment and control are different.
In some circumstances, the fitted lines might be parallel, and just the difference in intercept is of interest. In others, both the intercept and slope might differ, and any difference would be of interest.
Testing point vs line in ordinary regression (not errors-in-variables, which is more complicated):
It's not correct to check if data values for another are in the confidence interval because the data values themselves have noise.
Call the first sample $(\underline{x}_1,\underline{y}_1)$, and the second one $(\underline{x}_2,\underline{y}_2)$. Your model for the first sample is $y_1(i) = \alpha_1 + \beta_1 x_{1,i} + \varepsilon_i$, with the usual iid $N(0,\sigma^2)$ assumption on the errors.
You want to see if a particular point $(x_{2,j},y_{2,j})$ is consistent with the first sample. Equivalently, to check whether an interval for $y_{2,j} - \left(\alpha_1 + \beta_1 x_{2,j}\right)$ includes 0 (notice the points are second-sample, the line is first-sample).
The usual way to obtain such CI would to construct a pivotal quantity, though one could simulate or boostrap as well.
However, since in this illustration we're doing it for a single point, under normal assumptions and with ordinary regression conditions, we can save some effort: this is a solved problem. It corresponds to (assuming sample 1 and sample 2 have a common population variance) checking whether one of the sample 2 observations lies within a prediction interval based on sample 1, rather than a confidence interval.
Best Answer
There are two 95% CI you can derive from your data. One is the 95% CI of the regression line, which is the red one in the attached illustration. The code you provided is intended for plotting this 95% CI. Now, because it's for the line, not for the data points, as you get more data, the precision improves, and the band will narrow down. Your cited code is a somewhat special case because the sample size is only 7, so the 95% CI of the line happened to include about 90% of the data points; it's just a coincidence.
The interval that approximately includes 95% of the data points is shown below in green. I am not sure what it is called, but generally from what I have collected on this site, it should not be called confidence interval. I think you're looking to get these kind of lines, but have been using the incorrect code.
The one that matters more often is the red one. And for proper interpretation, other users have provided links to some useful posts.