Regression Data – Calculating the Mean Effectively

meanregressionself-study

I have some data to analyze where $y$ is dependent of $x$ – a linear regression was used.

It's a question from an exam, so I think it should be solvable. The regression was used to estimate the mean miles per gallon (response) from the amount of miles driven (predictor).

I have the following statistics available:

  • Correlation coefficient (0.117)
  • Standard deviation (0.482)
  • Number of observations (101)

An ANOVA of this regression yields (Regression and residuals, respectively):

  • df: 1, 99
  • SS: 0.319, 22.96
  • MS: 0.319, 0.232
  • F-value: 1.374, critical F-value: 0.244

The regression itself (Intercept and Slope, respectively):

  • Coefficients: 6.51, -0.00024
  • Standard deviations: 0.186, 0.0002
  • t-Values: 34.90, -1.17
  • p-Values: 1.93E-57, 0.2439

Also, the "upper and lower 95% and 99%" are given for the above regression (although I'm not sure what that means).

Now, I am asked to calculate the mean $y$ for several values $x$, that's relatively easy, I just use the coefficients. So for example, I can calculate the mean miles per gallon for 500 miles driven.

Part where I'm stuck: I need to calculate the 99% confidence interval for the mean of $y$.. Obviously, this is what the example is all about – the introduction states that the mileage of a car should be estimated.

My question: How can I find out the mean of $y$ using the data provided above? (And, subsequently, the 99% confidence interval, although I seem to have the standard deviation, so that shouldn't be the problem)

Best Answer

Contrary to @whuber's claim, the mean of x and y are contained in the information given.

Okay, so you have the line equation

$$y_i=\alpha +x_i\beta + e_i$$

estimates $\hat{\beta}=r\frac{s_y}{s_x}$ and $\hat{\alpha}=\overline{y}-\hat{\beta}\overline{x}$.

where $r$ is the correlation. The question doesn't state whether the standard deviation (0.482) is for $s_y$ or $s_x$ (the MLE standard deviation, with divisor $n$). Either way, you can work out the either from the info given. for their ratio must satisfy:

$$\frac{\hat{\beta}}{r}=\frac{s_y}{s_x}$$

The slope can't be negative if the correlation is positive, so I have assumed that you have done something incorrectly (for you have correlation of 0.117, and slope of -0.00024; this is impossible). This will affect the numbers, but not the general method. So I will assume the standard deviations are both known, but not write in the specific values. The same goes for the rest of the actual numbers.

Now the variance of $\hat{\beta}$ is given by:

$$var(\hat{\beta})=s_e^2(X^TX)^{-1}_{22}=\frac{s_e^2 (X^TX)_{11}}{|X^TX|}$$

Note that $(X^TX)_{11}=n$ and $s_e^2$ is the "mean square error". The variance of $\alpha$ is given by:

$$var(\hat{\alpha})=s_e^2(X^TX)^{-1}_{11}=\frac{s_e^2 (X^TX)_{22}}{|X^TX|}$$

Now $(X^TX)_{22}=\sum_i x_i^2 = n(s_x^2+n\overline{x}^2)$

And dividing these two variances gives:

$$\frac{var(\hat{\alpha})}{var(\hat{\beta})}=\frac{(X^TX)_{22}}{(X^TX)_{11}}=\frac{n(s_x^2+n\overline{x}^2)}{n}=s_x^2+n\overline{x}^2$$

Now all quantities in the equation are known, except for the mean $\overline{x}$. So we can re-arrange this equation and solve for the mean:

$$\overline{x}=\pm\sqrt{\frac{\frac{var(\hat{\alpha})}{var(\hat{\beta})}-s_x^2}{n}}$$

But we know from the start that $x_i>0$ - you can't drive "negative miles". So only the positive square root is to be taken. The rest is straight-forward CI stuff. The estimate of the mean $\hat{\overline{y}}$ is given by:

$$\hat{\overline{y}}=\hat{\alpha}+\hat{\beta}\overline{x}=\hat{\alpha}+\hat{\beta}\sqrt{\frac{\frac{var(\hat{\alpha})}{var(\hat{\beta})}-s_x^2}{n}}=\overline{y}$$

And the variance is given by:

$$var(\hat{\overline{y}})=var(\hat{\alpha})+\overline{x}^2 var(\hat{\beta})+2\overline{x}cov(\hat{\alpha},\hat{\beta})$$

Now the covariance is equal to: $$cov(\hat{\alpha},\hat{\beta})=s_e^2(X^TX)^{-1}_{21}=-\frac{s_e^2 (X^TX)_{21}}{|X^TX|}=-\frac{s_e^2 n\overline{x}}{ns_x^2}=-\frac{s_e^2 \overline{x}}{s_x^2}$$

And so the variance is given by:

$$var(\hat{\overline{y}})=var(\hat{\alpha})+\overline{x}^2 var(\hat{\beta})-2\frac{s_e^2 \overline{x}^2}{s_x^2}=var(\hat{\alpha})+\frac{\frac{var(\hat{\alpha})}{var(\hat{\beta})}-s_x^2}{n}\left(var(\hat{\beta})-2\frac{s_e^2}{s_x^2}\right)$$

So you construct your $100(1-P)$% confidence interval by choosing $T_{1-P/2}^{(n-2)}$ as the $P/2$ quantile of standard T distribution with $n-1$ degrees of freedom (which effectively equal to the standard normal, as $n-1=100$), and you have:

$$CI=\overline{y}\pm T_{1-P/2}^{(n-2)}\sqrt{var(\hat{\overline{y}})}$$

And all quantities are calculable, given the information.