Solved – Total least squares curve fit problem

curve fittingleast squaresMATLABregression

I am trying to fit a quadratic curve across a scatter plot of two variables. Since both variables are noisy I cannot use an ordinary least square regression (OLS) and I would like to have a symmetrical answer regardless which one I choose to be the independent variable.

Total (orthogonal) least squares regression seem to be the answer.

Unfortunately it seems that the two implementations that I have tried to use so far give me very bad fits.

In the figure below the blue line is the OLS fit, which obviously could be improved. The red is the TLS fit using the matlab/octave code below which seems to be the standard approach using single-value decompositions (SVD). I copied it from the corresponding Wikipedia article here.

I can see how it is trying to fit the upper-right denser cloud of points to the (very large) expense of the long tail. I have tested the code with simulated data and it works so is does not seem to be a bug.

I have also tried this other matlab package that implements a different iterative approach but unless I give it very tight bounds to the model parameters also gives me
"crazy" fits.

I want to be able to perform TLS between many variables with many bootstrapping repeats to obtain confidence intervals and I cannot afford to come out with bounds for all those repeats eyeballing each graph; that would also tamper with the confidence interval estimation I guess.

I reckon the problem is that the data is a bit too noisy but I can do little about it. You can find the data here.

Any ideas how to go around this issue? Perhaps some know a more robust TLS method with an easy to use implementation available? There seem to be many academic articles but little runnable code.

Many thanks in advance!
V.

enter image description here

function B = tls(xdata,ydata)
% xdata = nx3 matrix [x.^2 x ones(size(x))]
% ydata = nx1 matrix 
m       = length(ydata);       %number of x,y data pairs
X       = [xdata];
Y       = ydata;
n       = size(X,2);          % n is the width of X (X is m by n)
Z       = [X Y];              % Z is X augmented with Y.
[U S V] = svd(Z,0);           % find the SVD of Z.
VXY     = V(1:n,1+n:end);     % Take the block of V consisting of the first n rows and the n+1 to last column
VYY     = V(1+n:end,1+n:end); % Take the bottom-right block of V.
B       = -VXY/VYY;
end

Best Answer

Curves $y = a + b x + c x^2$ are parabolas that point straight up, so cannot match tilted parabolas (think of a big satellite antenna) no matter how you find $a\ b\ c$.


Edit 9 August: the 45° rotation in the original answer below is wrong, @whuber is right.
Consider noisy parabolas, all with $\bar{X} = \bar{Y} = 0$ and $s_X = s_Y$, tilted at various angles: 0° right, 45°, 90° up ... I see no direct way of finding the tilt. Brute force is klutzy, but may be good enough:

XY -= mean
for angle in e.g. [0 10 20 .. 170]:
    rotate XY by angle
    scale, XY /= std
    TLS: Y = XB + Res
    fit = rms residual_i = |data_i - nearest point on the parabola|
take the best fit.

If more accuracy is needed, use a 1d optimizer such as golden search.
A quite different approach would be to minimize a nonlinear function of a b c and tilt.

In short, the question of how to use TLS to directly fit a noisy tilted parabola seems to me open, even in 2d.


A method that might work well enough for shallow parabolas:

  1. centre the data at [0 0] -- subtract the means
  2. scale X and Y the same -- divide by their sd s
  3. now the 45° line $y = x$ is pretty good line fit; see methods-for-fitting-a-simple-measurement-error-model (or it might be $y = - x$). Rotate the data 45° clockwise, so that a parabola now points up
  4. least-squares fit a parabola, by TLS (or Ordinary ? try-it-and-see).
  5. reverse steps 1 - 3: rotate 45° counterclockwise, unscale, shift back.