[Math] Does SSTR (sum of squares for treatments) = SSR (regression sum of squares)

statistics

I'm on my first course into statistics and there seems to be something in common for Regression and ANOVA analysis:

in ANOVA I'm told that SST (total sum of squares) = SSE(error sum of squares) + SSTR (sum of squares for treatments)

while in regression, SST (total sum of squares) = SSE(error sum of squares) + SSR (regression sum of squares)

Does that mean [after algebraic manipulation from the above 2 equations] SSR = SSTR ?

I've looked up the formula for both SSR and SSTR and they are as followed:

SSR=
enter image description here

SSTR=enter image description here

Those two formulas look different. Does that mean I should not mix regression with ANOVA?

Best Answer

As Jonathan says you are indeed correct - well spotted. Basically if you have a 1-factor ANOVA with say q levels, then you have observations indexed by $j$ and $k$: $Y_{jk}$, $j=1,....,q,k=1,...,n_{q}$. The ANOVA model is $Y_{jk}=\mu_{j}+\epsilon_{jk}$, where $\mu_{j}$ represents the unknown true factor levels you want to estimate. This model can be written in "regression" notation by indexing your variables with just one index, say $i$: $Y_{i}=\sum\nolimits_{j=1}^{q}x_{ij}\beta_{j}+\epsilon_{i}$, $i=1,....,N$, where $x_{ij}=1$ for one of the $j's$ and zero otherwise. We assume that for your $N$ regression observations we have $n_{j}$ observations with $x_{ij}=1$, and that $\sum\nolimits_{j=1}^{q}=N$. Letting $\hat{Y_{i}}=\sum\nolimits_{j=1}^{q}x_{ij}\hat{\beta_{j}}$ we see that

$SSR=\sum\nolimits_{i=1}^{N}(\hat{Y_{i}}-\bar{Y})^{2}=\sum\nolimits_{i=1}^{N}(\sum\nolimits_{j=1}^{q}x_{ij}\hat{\beta_{j}}-\bar{Y})^{2}=\sum\nolimits_{j=1}^{q}n_{j}(\hat{\beta}_{j}-\bar{Y})^{2}$.

Now due to $x_{ij}$ being zero or one we find that $\hat{\beta}_{j}=\bar{Y}_{j}$ (I mean this to denote the average of the $Y$'s where $x_{ij}=1$), thus $\sum\nolimits_{i=1}^{N}(\hat{Y_{i}}-\bar{Y})^{2}=\sum\nolimits_{j=1}^{q}n_{j}(\bar{Y}_{j}-\bar{Y})^{2}$. In ANOVA notation we have $\bar{Y}_{j}=\bar{Y}_{j\cdot}$, and so

$SSR=\sum\nolimits_{i=1}^{N}(\hat{Y_{i}}-\bar{Y})^{2}=\sum\nolimits_{j=1}^{q}n_{j}(\bar{Y}_{j\cdot}-\bar{Y})^{2}=SSTR$.

Basically ANOVA is just a restricted form of regression, the restriction being the variables are factor variables rather than continuous ones. I find it much easier to learn about regression first, and to then think of ANOVAs in this way. This is because all the theory of regression carries over to ANOVA, but the theory about the sums of squares of ANOVAs only applies to these specific regression models, and not to a general one (where continuous and factor variables are present). If you have the time it is worth learning about regression as well as ANOVAs since the theory of ANOVAs gets you thinking from a designed experiment viewpoint (randomised controlled trials), whilst regression theory is more general since it really is about you already having your data (not from a designed experiment) and wanting to analyze it.