Solved – ordinary, in ordinary least squares

least squareslinear modelregressionterminology

A friend of mine recently asked what is so ordinary, about ordinary least squares. We did not seem to get anywhere in the discussion. We both agreed that OLS is special case of the linear model, it has many uses, is well know, and is a special case of many other models. But is this really all?

Therefore I would like to know:

Where did the name really come from?
Who was the first to use the name?

Best Answer

Least squares in $y$ is often called ordinary least squares (OLS) because it was the first ever statistical procedure to be developed circa 1800, see history. It is equivalent to minimizing the $L_2$ norm, $||Y-f(X)||_2$. Subsequently, weighted least squares, minimization of other norms (e.g., $L_1$), generalized least squares, M Estimation , bivariate minimization (e.g., Deming regression), non-parametric regression, maximum likelihood regression, regularization (e.g., Tikhonov, ridge) and other inverse problem techniques and multiple other tools were developed. There is still controversy over who first applied it, Gauss or Legendre (see link). The term "ordinary" (implying in $y$) was obviously added to "least squares" only after so many alternative methods arose that the (still most) popular OLS needed to be differentiated from the plethora of other minimizations that became available. When exactly adding ordinary$+$least squares occurred would be hard to track down since that occurred when it became natural or obvious to do so.

Related Solutions

Solved – Generalized Least Squares vs Ordinary Least Squares under a special case

There are two questions. First, there is a purely mathematical question about the possibility of decomposing the GLS estimator into the OLS estimator plus a correction factor. Second, there is a question about what it means when OLS and GLS are the same. (I will use ' rather than T throughout to mean transpose).

Also, I would appreciate knowing about any errors you find in the arguments.

Question 1

Ordinary Least Squares (OLS) solves the following problem: \begin{align} min_x\;\left(y-Hx\right)'\left(y-Hx\right) \end{align} leading to the solution: \begin{align} \hat{x}_{OLS}=\left(H'H\right)^{-1}H'y \end{align} Generalized Least Squares (GLS) solves the following problem: \begin{align} min_x\;\left(y-Hx\right)'C^{-1}\left(y-Hx\right) \end{align} leading to the solution: \begin{align} \hat{x}_{OLS}=\left(H'C^{-1}H\right)^{-1}H'C^{-1}y \end{align} Now, make the substitution $C^{-1}=X+I$ in the GLS problem: \begin{align} min_x\;&\left(y-Hx\right)'\left(X+I\right)\left(y-Hx\right)\\~\\ min_x\;&\left(y-Hx\right)'X\left(y-Hx\right) + \left(y-Hx\right)'\left(y-Hx\right)\\ \end{align} The solution is still characterized by first order conditions since we are assuming that $C$ and therefore $C^{-1}$ are positive definite: \begin{align} 0=&2\left(H'XH\hat{x}_{GLS}-H'Xy\right) +2\left(H'H\hat{x}_{GLS}-H'y\right)\\ \hat{x}_{GLS}=&\left(H'H\right)^{-1}H'y+\left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\hat{x}_{GLS}\\ \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\hat{x}_{GLS}\\ \end{align}

I can see two ways to give you what you asked for in the question from here. First, we have a formula for the $\hat{x}_{GLS}$ on the right-hand-side of the last expression, namely $\left(H'C^{-1}H\right)^{-1}H'C^{-1}y$. Thus, the above expression is a closed form solution for the GLS estimator, decomposed into an OLS part and a bunch of other stuff. The other stuff, obviously, goes away if $H'X=0$. To be clear, one possible answer to your first question is this: \begin{align} \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\left(H'C^{-1}H\right)^{-1}H'C^{-1}y\\ \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'X \left(I -H\left(H'C^{-1}H\right)^{-1}H'C^{-1}\right)y \end{align} I can't say I get much out of this. That awful mess near the end multiplying $y$ is a projection matrix, but onto what?

Another way you could proceed is to go up to the line right before I stopped to note there are two ways to proceed and to continue thus: \begin{align} \left(I+\left(H'H\right)^{-1}H'XH\right)\hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy\\ \hat{x}_{GLS}=& \left(I+\left(H'H\right)^{-1}H'XH\right)^{-1}\left(\hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy\right) \end{align} Again, GLS is decomposed into an OLS part and another part. The other part goes away if $H'X=0$. I still don't get much out of this. What this one says is that GLS is the weighted average of OLS and a linear regression of $Xy$ on $H$. I guess you could think of $Xy$ as $y$ suitably normalized--that is after having had the "bad" part of the variance $C$ divided out of it.

I should be careful and verify that the matrix I inverted in the last step is actually invertible: \begin{align} \left(I+\left(H'H\right)^{-1}H'XH\right) &= \left(H'H\right)^{-1}\left(H'H+H'XH\right)\\ &= \left(H'H\right)^{-1}H'\left(I+X\right)H\\ &= \left(H'H\right)^{-1}H'C^{-1}H \end{align}

Question 2

The question here is when are GLS and OLS the same, and what intuition can we form about the conditions under which this is true? I will only provide an answer here for a special case on the structure of $C$. The requirement is: \begin{align} \left(H'C^{-1}H\right)^{-1}H'C^{-1}Y = \left( H'H\right)^{-1}H'Y \end{align}

To form our intuitions, let's assume that $C$ is diagonal, let's define $\overline{c}$ by $\frac{1}{\overline{c}}=\frac{1}{K}\sum \frac{1}{C_{ii}}$, and let's write: \begin{align} \left(H'C^{-1}H\right)^{-1}H'C^{-1}Y &= \left(H'\overline{c}C^{-1}H\right)^{-1}H'\overline{c}C^{-1}Y\\ &=\left( H'H\right)^{-1}H'Y \end{align}

One way for this equation to hold is for it to hold for each of the two factors in the equation: \begin{alignat}{3} \left(H'\overline{c}C^{-1}H\right)^{-1} &=\left( H'H\right)^{-1} & \iff& & H'\left(\overline{c}C^{-1}-I\right)H&=0\\ H'\overline{c}C^{-1}Y&=H'Y & \iff& & H'\left(\overline{c}C^{-1}-I\right)Y&=0 \end{alignat} Remembering that $C$, $C^{-1}$, and $I$ are all diagonal and denoting by $H_i$ the $i$th row of $H$: \begin{alignat}{3} H'\left(\overline{c}C^{-1}-I\right)H&=0 & \iff& & \frac{1}{K} \sum_{i=1}^K H_iH_i'\left( \frac{\overline{c}}{C_{ii}}-1\right)=0\\~\\ H'\left(\overline{c}C^{-1}-I\right)Y&=0 & \iff& & \frac{1}{K} \sum_{i=1}^K H_iY_i\left( \frac{\overline{c}}{C_{ii}}-1\right)=0 \end{alignat} What are those things on the right-hand-side of the double-headed arrows? They are a kind of sample covariance. To see this, notice that the mean of $\frac{\overline{c}}{C_{ii}}$ is 1, by the construction of $\overline{c}$. Finally, we are ready to say something intuitive. In this special case, OLS and GLS are the same if the inverse of the variance (across observations) is uncorrelated with products of the right-hand-side variables with each other and products of the right-hand-side variables with the left-hand-side variable. This is a very intuitive result.

In estimating the linear model, we only use the products of the RHS variables with each other and with the LHS variable, $(H'H)^{-1}H'y$. In GLS, we weight these products by the inverse of the variance of the errors. When does that re-weighting do nothing, on average? Why, when the weights are uncorrelated with the thing they are re-weighting! Yes? When is a weighted average the same as a simple average? When the weights are uncorrelated with the things you are averaging.

This insight, by the way, if I am remembering correctly, is due to White(1980) and perhaps Huber(1967) before him---I don't recall exactly.

Least Squares – Solving Total Least Squares Curve Fit Problem

Curves $y = a + b x + c x^2$ are parabolas that point straight up, so cannot match tilted parabolas (think of a big satellite antenna) no matter how you find $a\ b\ c$.

Edit 9 August: the 45° rotation in the original answer below is wrong, @whuber is right.
Consider noisy parabolas, all with $\bar{X} = \bar{Y} = 0$ and $s_X = s_Y$, tilted at various angles: 0° right, 45°, 90° up ... I see no direct way of finding the tilt. Brute force is klutzy, but may be good enough:

XY -= mean
for angle in e.g. [0 10 20 .. 170]:
    rotate XY by angle
    scale, XY /= std
    TLS: Y = XB + Res
    fit = rms residual_i = |data_i - nearest point on the parabola|
take the best fit.

If more accuracy is needed, use a 1d optimizer such as golden search.
A quite different approach would be to minimize a nonlinear function of a b c and tilt.

In short, the question of how to use TLS to directly fit a noisy tilted parabola seems to me open, even in 2d.

A method that might work well enough for shallow parabolas:

centre the data at [0 0] -- subtract the means
scale X and Y the same -- divide by their sd s
now the 45° line $y = x$ is pretty good line fit; see methods-for-fitting-a-simple-measurement-error-model (or it might be $y = - x$). Rotate the data 45° clockwise, so that a parabola now points up
least-squares fit a parabola, by TLS (or Ordinary ? try-it-and-see).
reverse steps 1 - 3: rotate 45° counterclockwise, unscale, shift back.

Best Answer

Related Solutions

Solved – Generalized Least Squares vs Ordinary Least Squares under a special case

Least Squares – Solving Total Least Squares Curve Fit Problem

Related Question