Solved – Fitting a straight line: Total Least Squares or Ordinary Least Squares

regression

I want to fit a straight line through a scatter plot of two timeseries to understand the influence sea surface temperatures (x-axis) have on land temperature over a particular region (y-axis). I have calculated the correlation coefficient which isn't particularly strong (0.16), but I also want to fit a straight line through this data, which is the part I'm not sure about. For TLS (Total Least Squares) I have used scipy.odr and for OLS (Ordinary Least Squares) I have used numpy.polyfit, with one degree of the fitted polynomial (I am also open to using R if required).

The gradient of the fitted lines seem very different, so I figure this is important to work out. I am getting contradictory advise from colleagues of mine and am hoping to settle this here. It would be helpful if answers could explain why one or the other method should be used. See figure attached.

—-EDIT—-
I realise this may be a poor example as the correlation is poor. So please rather focus on the fitted lines, as I will have cases with higher correlations.

[Fitting a straight line through points using two different methods]

Best Answer

It is very good that you explicitly state your goal, i.e. "I want ... to understand the influence sea surface temperatures (x-axis) have on land temperature over a particular region (y-axis)". Too often this aspect is ignored in these sorts of questions!

First, as always it is important to understand that correlation does not imply causation.

Now, the two approaches to line fitting differ statistically in that OLS treats $x$ as "error free", while TLS (a.k.a. "errors in variables" linear regression) treats uncertainty in both $x$ and $y$. (These are treated symmetrically in the case of orthogonal least squares.)

The two approaches also differ in their goals: Orthogonal least squares is similar to PCA, and is essentially fitting a multivariate Gaussian joint distribution $p[x,y]$ to the data (in the 2D case, at least). Ordinary least squares is more oriented to fitting a set of conditional Gaussian distributions $p[y \vert x]$ to the data.

Now, as your $x$ and $y$ variables have the same units (both are temperatures), and similar ranges, then orthogonal least squares is certainly reasonable. It is difficult to tell (given the large size, low transparency, and high density of over-printed points), but the TLS line appears to better capture the data as well.

A summary of the usefulness of the two approaches might be as follows:

  • If your goal is to constrain the distribution of $y$ given a precise value of $x$, then the OLS curve is what you want. (For example $R^2$ gives the reduction in variance for $y|x$ vs. $y$).
  • If your goal is to constrain the "independent components" of the 2D $(x,y)$ data, then TLS is better. For example this first principle component may have a common cause, in terms of the system dynamics.

Given your stated goal, it appears that the OLS line ($p[y|x])$ is what you are probably after.

However, note that OLS assumes that the residual variance is independent of $x$ (i.e. $\sigma^2_{y|x}\neq f[x]$), a condition known by the colorful term "homoskedastic". This is something you should check (e.g. by plotting residuals). As noted above, your plot is difficult to judge by eye, but it appears the ($y$) spread around the OLS line may have some variations in the $x$ direction. (So, as noted above, the TLS line may be a more reliable fit.)

Related Question