Solved – Showing that the OLS estimator is scale equivariant

least squareslinear model

I don't have a formal definition of scale equivariance, but here's what Introduction to Statistical Learning says about this on p. 217:

The standard least squares coefficients… are scale equivariant: multiplying $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$.

For simplicity, let's assume the general linear model $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$, where $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}$ is a $N \times (p+1)$ matrix (where $p+1 < N$) with all entries in $\mathbb{R}$, $\boldsymbol\beta \in \mathbb{R}^{p+1}$, and $\boldsymbol\epsilon$ is a $N$-dimensional vector of real-valued random variables with $\mathbb{E}[\boldsymbol\epsilon] = \mathbf{0}_{N \times 1}$.

From OLS estimation, we know that if $\mathbf{X}$ has full (column) rank,
$$\hat{\boldsymbol\beta}_{\mathbf{X}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\text{.}$$
Suppose we multiplied a column of $\mathbf{X}$, say $\mathbf{x}_k$ for some $k \in \{1, 2, \dots, p+1\}$, by a constant $c \neq 0$. This would be equivalent to the matrix
\begin{equation}
\mathbf{X}\underbrace{\begin{bmatrix}
1 & \\
& 1 \\
& & \ddots \\
& & & 1 \\
& & & & c\\
& & & & & 1 \\
& & & & & &\ddots \\
& & & & & & & 1
\end{bmatrix}}_{\mathbf{S}} =
\begin{bmatrix} \mathbf{x}_1 & \mathbf{x}_2 & \cdots & c\mathbf{x}_{k} & \cdots & \mathbf{x}_{p+1}\end{bmatrix} \equiv \tilde{\mathbf{X}}
\end{equation}
where all other entries of the matrix $\mathbf{S}$ above are $0$, and $c$ is in the $k$th entry of the diagonal of $\mathbf{S}$. Then, $\tilde{\mathbf X}$ has full (column) rank as well, and the resulting OLS estimator using $\tilde{\mathbf X}$ as the new design matrix is
$$\hat{\boldsymbol\beta}_{\tilde{\mathbf{X}}} = \left(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}}\right)^{-1}\tilde{\mathbf{X}}^{T}\mathbf{y}\text{.}$$
After some work, one can show that
$$\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} = \begin{bmatrix}
\mathbf{x}_1^{T}\mathbf{x}_1 & \mathbf{x}_1^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_1^{T}\mathbf{x}_k & \cdots & \mathbf{x}_1^{T}\mathbf{x}_{p+1} \\
\mathbf{x}_2^{T}\mathbf{x}_1 & \mathbf{x}_2^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_2^{T}\mathbf{x}_k & \cdots & \mathbf{x}_2^{T}\mathbf{x}_{p+1} \\
\vdots & \vdots & \ddots & \vdots & \cdots & \vdots \\
c\mathbf{x}_k^{T}\mathbf{x}_1 & c\mathbf{x}_k^{T}\mathbf{x}_2 & \cdots & c^2\mathbf{x}_k^{T}\mathbf{x}_k & \cdots & c\mathbf{x}_k^{T}\mathbf{x}_{p+1} \\
\vdots & \vdots & \vdots & \vdots & \cdots & \vdots \\
\mathbf{x}_{p+1}^{T}\mathbf{x}_1 & \mathbf{x}_{p+1}^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_{p+1}^{T}\mathbf{x}_{p+1} & \cdots & \mathbf{x}_{p+1}^{T}\mathbf{x}_{p+1} \\
\end{bmatrix}$$
and
$$\tilde{\mathbf{X}}^{T}\mathbf{y} = \begin{bmatrix}
\mathbf{x}_1^{T}\mathbf{y} \\
\mathbf{x}_2^{T}\mathbf{y} \\
\vdots \\
c\mathbf{x}_k^{T}\mathbf{y} \\
\vdots \\
\mathbf{x}_{p+1}^{T}\mathbf{y}
\end{bmatrix}$$
How do I go from here to show the claim quoted above (i.e., that $\hat{\boldsymbol\beta}_{\tilde{\mathbf{X}}} = \dfrac{1}{c}\hat{\boldsymbol\beta}_{\mathbf{X}}$)? It's not clear to me how to compute $(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})^{-1}$.

Best Answer

Since the assertion in the quotation is a collection of statements about rescaling the columns of $X$, you might as well prove them all at once. Indeed, it takes no more work to prove a generalization of the assertion:

When $X$ is right-multiplied by an invertible matrix $A$, then the new coefficient estimate $\hat\beta_A$ is equal to $\hat \beta$ left-multiplied by $A^{-1}$.

The only algebraic facts you need are the (easily proven, well-known ones) that $(AB)^\prime=B^\prime A^\prime$ for any matrices $AB$ and $(AB)^{-1}=B^{-1}A^{-1}$ for invertible matrices $A$ and $B$. (A subtler version of the latter is needed when working with generalized inverses: for invertible $A$ and $B$ and any $X$, $(AXB)^{-} = B^{-1}X^{-}A^{-1}$.)


Proof by algebra: $$\hat\beta_A = ((XA)^\prime ((XA))^{-}(XA)^\prime y = A^{-1}(X^\prime X)^{-} (A^\prime)^{-1}A^\prime y = A^{-1}\hat \beta,$$

QED. (In order for this proof to be fully general, the $^-$ superscript refers to a generalized inverse.)


Proof by geometry:

Given bases $E_p$ and $E_n$ of $\mathbb{R}^n$ and $\mathbb{R}^p$, respectively, $X$ represents a linear transformation from $\mathbb{R}^p$ to $\mathbb{R}^n$. Right-multiplication of $X$ by $A$ can be considered as leaving this transformation fixed but changing $E_p$ to $AE_p$ (that is, to the columns of $A$). Under that change of basis, the representation of any vector $\hat\beta\in\mathbb{R}^p$ must change via left-multiplication by $A^{-1}$, QED.

(This proof works, unmodified, even when $X^\prime X$ is not invertible.)


The quotation specifically refers to the case of diagonal matrices $A$ with $A_{ii}=1$ for $i\ne j$ and $A_{jj}=c$.


Connection with least squares

The objective here is to use first principles to obtain the result, with the principle being that of least squares: estimating coefficients that minimize the sum of squares of residuals.

Again, proving a (huge) generalization proves no more difficult and is rather revealing. Suppose $$\phi:V^p\to W^n$$ is any map (linear or not) of real vector spaces and suppose $Q$ is any real-valued function on $W^n$. Let $U\subset V^p$ be the (possibly empty) set of points $v$ for which $Q(\phi(v))$ is minimized.

Result: $U$, which is determined solely by $Q$ and $\phi$, does not depend on any choice of basis $E_p$ used to represent vectors in $V^p$.

Proof: QED.

There's nothing to prove!

Application of the result: Let $F$ be a positive semidefinite quadratic form on $\mathbb{R}^n$, let $y\in\mathbb{R}^n$, and suppose $\phi$ is a linear map represented by $X$ when bases of $V^p=\mathbb{R}^p$ and $W^n=\mathbb{R}^n$ are chosen. Define $Q(x)=F(y,x)$. Choose a basis of $\mathbb{R}^p$ and suppose $\hat\beta$ is the representation of some $v\in U$ in that basis. This is least squares: $x=X\hat\beta$ minimizes the squared distance $F(y,x)$. Because $X$ is a linear map, changing the basis of $\mathbb{R}^p$ corresponds to right-multiplying $X$ by some invertible matrix $A$. That will left-multiply $\hat\beta$ by $A^{-1}$, QED.

Related Question