Solved – Proof of link between the OLS slope estimate and two sample t test statistic (categorical Xvar)

least squaresmathematical-statisticsregression

Regarding a univariate OLS regression with a single categorical predictor (coded 0,1).

I am wrestling with the proof that
$$t =\frac{b_1}{s(b_1)} $$

starting from the basic OLS estimator for the slope which is
$$b_1= \frac{\sum (X_i – \bar{X})Y_i}{\sum(X_i – \bar{X})^2}. $$

I know that the first step is to show that the denominator $\sum(X_i – \bar{X})^2$ is equal to

$$\frac{n_1n_0}{n} $$

where $n_0$ and $n_1$ are the groups A and B where $X_i = 0$ and $1$ respectively.

I just can't get there, and know my partial summation algebra is lacking.

I am comfortable that $\sum(X_i – \bar{x})^2 = \sum(X_A – \bar{x})^2 + \sum(X_B – \bar{x})^2$ and then expanding each of these to the form $\sum X_i^2 – n\bar{x} ^2$ but can't get further than this:

$$\sum_{i=1}^{n_0} X_i^2 – n_0\bar{x} ^2 + \sum_{i=1}^{n_1} X_i^2 – n_1\bar{x} ^2.$$

I think the next step hinges on the fact that for the zero group $\sum X_i^2 = 0$ and for the 1 group $\sum X_i^2 = 1$.

Any advice on what I am missing to move forward here? Or if anyone can point me to a complete proof I'd be really grateful.

Best Answer

This becomes easy when you reparameterize the problem.

Instead of using a slope and intercept, notice that when there are just two distinct values of the $x_i$ you can describe the fit by giving its value $\eta_0$ for $x=0$ and its value $\eta_1$ for $x=1$.

Figure

This example shows the data as red dots, the OLS fit as a dashed line, and summarizes the two groups with boxplots. Group $A$ is at the left and group $B$ at the right. The slope of the line is precisely the amount needed to go from the mean of group $A$, with $\eta_0$ near $10$, to the mean of group $B$, with $\eta_1$ near $13$.

Least squares requires you to choose values of these parameters that minimize the sum of squares of residuals. Since the value of $\eta_0$ affects the residuals only for group $A$ (where $x_i=0$) and $\eta_1$ affects the residuals only for group $B$ (where $x_i=1$), each will be estimated as the mean of its associated group. Because these means also happen to be the Maximum Likelihood estimates (as well as the OLS estimates), the ML estimate of the slope (which is also its OLS estimate) must be

$$b_1 = \frac{\hat\eta_1 - \hat\eta_0}{1-0} = \hat\eta_1 -\hat\eta_0,$$

which is just the difference in the group means. The OLS estimate of its variance (which does differ from the ML estimate, so we cannot exploit ML at this point) is the sum of squared residuals divided by the degrees of freedom, which is $n-2$. It should be equally obvious that this is precisely the pooled variance for the two-sample t-test. Consequently, $b_1/se(b_1)$ is exactly the same--and computed in exactly the same way--as the Student t statistic.

Related Question