Solved – Enforcing orthogonality of inputs for multiple linear regression

linear modelmultiple regressionregressionself-study

I am studying the well-known book Elements of Statistical Learning. When the multiple linear regression is described it uses the simple univariate regression as a building block, which makes sense to me. As far as I understand it uses orthogonality property of input vectors in order to split multivariate regression in simple independent regressions, and when inputs are not orthogonal, then those inputs are transformed in such a way that what remains is orthogonal. By orthogonal vectors I understand 2 vectors which have dot product equals with zero.

Now in the book is noted that

Orthogonal inputs occurs most often with balanced, designed experiments (where orthogonality is enforced), but almost never with observational data.

How can one enforce that? The only situation that I can imagine is when one would use binary 0/1 values for each possible nominal value. To be more clear: one could have a nominal column sex with labels: male and female. He can create two input columns, one called sex.male with value 1 when sex is male and 0 otherwise. The corresponding column sex.female would then have 1 if sex is female and 0 otherwise. These 2 numerical columns would be orthogonal. Is possible to do enforcing for continuous variables?

Best Answer

There are plenty of examples of orthogonal designs for continuous predictors in the experimental design literature. A simple one is the design matrix (using centred predictors)

$$\boldsymbol{X}=(\boldsymbol{I},\boldsymbol{x}_1,\boldsymbol{x}_2)=\left(\begin{matrix} 1 & -1 & -1\\ 1 & -1 & 0\\ 1 & -1 & 1\\ 1 & 0 & -1\\ 1 & 0 & 0\\ 1 & 0 & 1\\ 1 & 1 & -1\\ 1 & 1 & 0\\ 1 & 1 & 1\\ \end{matrix}\right)$$

for the linear regression $$y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} +\varepsilon_i$$

The diagonal variance–covariance matrix for the parameter estimates

$$\operatorname{Var} \boldsymbol{\hat\beta}= (\boldsymbol{X}^\mathrm{T}\boldsymbol{X})^{-1}\sigma^2=\left(\begin{matrix} \tfrac{1}{9} & 0 & 0\\ 0 & \tfrac{1}{6} & 0\\ 0 & 0 & \tfrac{1}{6}\\ \end{matrix}\right)\sigma^2$$

where $\sigma^2$ is the error variance, shows that you have uncorrelated estimators for $\beta_1$ & $\beta_2$

One-way ANOVA

Assume for the sake of simplicity that you have $I=2$ groups and $J=3$ observations in each group. Thus you have a rectangular dataset: $$ y = \begin{pmatrix} y_{11} & y_{12} & y_{13} \\ y_{21} & y_{22} & y_{23} \end{pmatrix} $$ and you assume a model $$ \begin{pmatrix} y_{11} & y_{12} & y_{13} \\ y_{21} & y_{22} & y_{23} \end{pmatrix} = \begin{pmatrix} \mu_1 & \mu_1 & \mu_1 \\ \mu_2 & \mu_2 & \mu_2 \end{pmatrix} + \sigma \begin{pmatrix} \epsilon_{11} & \epsilon_{12} & \epsilon_{13} \\ \epsilon_{21} & \epsilon_{22} & \epsilon_{23} \end{pmatrix} $$ with $\epsilon_{ij} \sim_{\text{iid}} {\cal N}(0,1)$.

The general form a linear Gaussian model is usually written in stacked form $\boxed{y=\mu + \sigma \epsilon}$ with $y$ is vector-valued (say in $\mathbb{R}^n$), $\mu$ is assumed to lie in a linear subspace of $\mathbb{R}^n$ (say $W$) and $\epsilon$ is a vector of $\epsilon_{k} \sim_{\text{iid}} {\cal N}(0,1)$. Here this would be $n=IJ=6$ and for example $$\begin{pmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{21} \\ y_{22} \\ y_{23} \end{pmatrix} = \begin{pmatrix} \mu_{1} \\ \mu_{1} \\ \mu_{1} \\ \mu_{2} \\ \mu_{2} \\ \mu_{2} \end{pmatrix} + \sigma \begin{pmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \end{pmatrix}$$ and the rectangular structure is lost, and this is not the way to go.

Actually the cleanest way to treat the balanced one-way ANOVA model consists in using the tensor product $\mathbb{R}^I \otimes \mathbb{R}^J$ instead of $\mathbb{R}^n$. The notion of tensor product is not usually adressed in elementary course on linear algebra, but it is not complicated and we will use it only as a convenient language to adress the problem.

First of all, the linear space $W$ in which $\mu$ is assumed to lie has a very convenient form with the tensor product language. The tensor product $x \otimes y$ is also defined for two vectors $x$ and $y$ and with this elementary operation, one has $$ \mu = \begin{pmatrix} \mu_1 & \mu_1 & \mu_1 \\ \mu_2 & \mu_2 & \mu_2 \end{pmatrix} = (\mu_1, \mu_2) \otimes (1,1,1) \in \boxed{W:= \mathbb{R}^I \otimes [(1,1,1)]}, $$ where $[(1,1,1)]$ denotes the vector space spanned by $(1,1,1) \in \mathbb{R}^J$.

Now let me denote ${\bf 1}_J = (1,1,1) \in \mathbb{R}^J$ and ${\bf 1}_I = (1,1) \in \mathbb{R}^I$. The orthogonal parameters you're talking about are $m$ and the $\alpha_i$ defined by $$\boxed{\mu_i = m + \alpha_i} \quad \text{with } \sum_{i=1}^I\alpha_i=0.$$ The vector space $\mathbb{R}^I$ has the ortogonal decomposition $\mathbb{R}^I=[{\bf 1}_I]\oplus{[{\bf 1}_I]}^\perp$, therefore $W$ has the orthogonal decomposition $$\boxed{W= \mathbb{R}^I \otimes [{\bf 1}_J] = \Bigl([{\bf 1}_I] \otimes [{\bf 1}_J]\Bigr) \oplus \Bigl({[{\bf 1}_I]}^\perp \otimes [{\bf 1}_J]\Bigr)}.$$ (we use the distributivity rule $(A \oplus B) \otimes C= (A \otimes C) \oplus (B \otimes C)$ which is elementary derived from the definitions of the tensor product).

Then the parameters $m$ and $\alpha_i$ appears in the orthogonal decomposition of $\mu$: $$\begin{align*} \mu = (\mu_1, \ldots, \mu_I) \otimes {\bf 1}_J & = \begin{pmatrix} m & m & m \\ m & m & m \end{pmatrix} + \begin{pmatrix} \alpha_1 & \alpha_1 & \alpha_1 \\ \alpha_2 & \alpha_2 & \alpha_2 \end{pmatrix} \\ & = \underset{\in \bigl([{\bf 1}_I]\otimes[{\bf 1}_J]\bigr)}{\underbrace{m({\bf 1}_I\otimes{\bf 1}_J)}} + \underset{\in \bigl([{\bf 1}_I]^{\perp}\otimes[{\bf 1}_J] \bigr)}{\underbrace{(\alpha_1,\ldots,\alpha_I)\otimes{\bf 1}_J}} \end{align*}$$ and this is why orthogonality occurs, because the least-squares estimates are obtained by projecting $y$ on $W$.

Two-way ANOVA without interaction

For the two-way ANOVA without interaction, one assumes $y_{ij} \sim {\cal N}(\mu_{ij}, \sigma^2)$ and the $\mu_{ij}$ have form $$\mu_{ij} = m + \alpha_i + \beta_j, \quad \sum_{i=1}^I\alpha_i=0, \quad \sum_{j=1}^J\beta_j=0.$$ Consider the orthogonal decompositions $\mathbb{R}^I =[{\bf 1}_I]\oplus{[{\bf 1}_I]}^\perp$ and $\mathbb{R}^J =[{\bf 1}_J]\oplus{[{\bf 1}_J]}^\perp$. Then we get the orthogonal decomposition $$\mathbb{R}^I\otimes \mathbb{R}^J = \underset{=:W}{\underbrace{\Bigl([{\bf 1}_I] \otimes [{\bf 1}_J]\Bigr) \oplus \Bigl({[{\bf 1}_I]}^\perp \otimes [{\bf 1}_J]\Bigr) \oplus \Bigl([{\bf 1}_I] \otimes {[{\bf 1}_J]}^\perp\Bigr)}} \oplus \underset{=W^\perp}{\underbrace{\Bigl({[{\bf 1}_I]}^\perp \otimes {[{\bf 1}_J]}^\perp\Bigr)}}. $$

This is the origin of orthonormality, similarly to the case of the balanced one-way ANOVA model.

Two-way ANOVA with interaction (and replication)

Here, $y_{ijk} \sim {\cal N}(\mu_{ij}, \sigma^2)$ and the $\mu_{ij}$ have form $$\mu_{ij} = m + \alpha_i + \beta_j + \gamma_{ij}, \quad \sum_{i=1}^I\alpha_i=0, \quad \sum_{j=1}^J\beta_j=0, \\ \sum_{i=1}^I\gamma_{ij}=0 \text{ for every $j$}, \quad \sum_{j=1}^J\gamma_{ij}=0 \text{ for every $i$}.$$

Here we have to orthogonally decompose $\mathbb{R}^I\otimes \mathbb{R}^J \otimes \mathbb{R}^K$ by distributing the three orthogonal decompositions $$\mathbb{R}^I =[{\bf 1}_I]\oplus{[{\bf 1}_I]}^\perp, \quad \mathbb{R}^J =[{\bf 1}_J]\oplus{[{\bf 1}_J]}^\perp, \quad \mathbb{R}^K =[{\bf 1}_K]\oplus{[{\bf 1}_K]}^\perp.$$ By doing it, we find an orthogonal decomposition of $W=\mathbb{R}^I\otimes\mathbb{R}^J\otimes [{\bf 1}_K]$ into the following parts:

$[{\bf 1}_I] \otimes [{\bf 1}_J] \otimes [{\bf 1}_K]$ corresponding to $m$
${[{\bf 1}_I]}^\perp \otimes [{\bf 1}_J] \otimes [{\bf 1}_K]$ corresponding to the $\alpha_i$
$[{\bf 1}_I] \otimes {[{\bf 1}_J]}^\perp \otimes [{\bf 1}_K]$ corresponding to the $\beta_j$
${[{\bf 1}_I]}^\perp \otimes {[{\bf 1}_J]}^\perp \otimes [{\bf 1}_K]$ corresponding to the $\gamma_{ij}$

Best Answer

Related Solutions

Solved – the role of a categorical predictor in polynomial regression

Solved – Why are all regression predictors in a balanced factorial ANOVA orthogonal

One-way ANOVA

Two-way ANOVA without interaction

Two-way ANOVA with interaction (and replication)

Related Question