Solved – Total least squares curve fit problem

curve fittingleast squaresMATLABregression

I am trying to fit a quadratic curve across a scatter plot of two variables. Since both variables are noisy I cannot use an ordinary least square regression (OLS) and I would like to have a symmetrical answer regardless which one I choose to be the independent variable.

Total (orthogonal) least squares regression seem to be the answer.

Unfortunately it seems that the two implementations that I have tried to use so far give me very bad fits.

In the figure below the blue line is the OLS fit, which obviously could be improved. The red is the TLS fit using the matlab/octave code below which seems to be the standard approach using single-value decompositions (SVD). I copied it from the corresponding Wikipedia article here.

I can see how it is trying to fit the upper-right denser cloud of points to the (very large) expense of the long tail. I have tested the code with simulated data and it works so is does not seem to be a bug.

I have also tried this other matlab package that implements a different iterative approach but unless I give it very tight bounds to the model parameters also gives me
"crazy" fits.

I want to be able to perform TLS between many variables with many bootstrapping repeats to obtain confidence intervals and I cannot afford to come out with bounds for all those repeats eyeballing each graph; that would also tamper with the confidence interval estimation I guess.

I reckon the problem is that the data is a bit too noisy but I can do little about it. You can find the data here.

Any ideas how to go around this issue? Perhaps some know a more robust TLS method with an easy to use implementation available? There seem to be many academic articles but little runnable code.

Many thanks in advance!
V.

enter image description here

function B = tls(xdata,ydata)
% xdata = nx3 matrix [x.^2 x ones(size(x))]
% ydata = nx1 matrix 
m       = length(ydata);       %number of x,y data pairs
X       = [xdata];
Y       = ydata;
n       = size(X,2);          % n is the width of X (X is m by n)
Z       = [X Y];              % Z is X augmented with Y.
[U S V] = svd(Z,0);           % find the SVD of Z.
VXY     = V(1:n,1+n:end);     % Take the block of V consisting of the first n rows and the n+1 to last column
VYY     = V(1+n:end,1+n:end); % Take the bottom-right block of V.
B       = -VXY/VYY;
end

Best Answer

Curves $y = a + b x + c x^2$ are parabolas that point straight up, so cannot match tilted parabolas (think of a big satellite antenna) no matter how you find $a\ b\ c$.

Edit 9 August: the 45° rotation in the original answer below is wrong, @whuber is right.
Consider noisy parabolas, all with $\bar{X} = \bar{Y} = 0$ and $s_X = s_Y$, tilted at various angles: 0° right, 45°, 90° up ... I see no direct way of finding the tilt. Brute force is klutzy, but may be good enough:

XY -= mean
for angle in e.g. [0 10 20 .. 170]:
    rotate XY by angle
    scale, XY /= std
    TLS: Y = XB + Res
    fit = rms residual_i = |data_i - nearest point on the parabola|
take the best fit.

If more accuracy is needed, use a 1d optimizer such as golden search.
A quite different approach would be to minimize a nonlinear function of a b c and tilt.

In short, the question of how to use TLS to directly fit a noisy tilted parabola seems to me open, even in 2d.

A method that might work well enough for shallow parabolas:

centre the data at [0 0] -- subtract the means
scale X and Y the same -- divide by their sd s
now the 45° line $y = x$ is pretty good line fit; see methods-for-fitting-a-simple-measurement-error-model (or it might be $y = - x$). Rotate the data 45° clockwise, so that a parabola now points up
least-squares fit a parabola, by TLS (or Ordinary ? try-it-and-see).
reverse steps 1 - 3: rotate 45° counterclockwise, unscale, shift back.

Related Solutions

Solved – Strange outcome when performing nonlinear least squares fit to a power law

It would be fairly unusual to find power law data for which the variance is actually constant across $x$ values. However, let's take that constant variance as given and proceed with the analysis as a least squares problem.

The easy way to fit this particular model is actually to use GLMs, not NLS! No need for starting values.

I don't have Matlab on this machine, but in R:

x <- seq(4,6.5,0.5)
y <- 1.0e-04 * c(0.011929047861902, 0.026053683604530, 0.057223759612162,
    0.117413370572612, 0.242357468772138, 0.462327165034928)

 powfit <- glm(y~log(x),family=gaussian(link=log))
 summary(powfit)

[snip]

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -25.05254    0.16275 -153.93 1.07e-08 ***
log(x)        8.05095    0.08824   91.24 8.65e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6.861052e-14)

    Null deviance: 1.5012e-09  on 5  degrees of freedom
Residual deviance: 2.2273e-13  on 4  degrees of freedom
AIC: -162.52

Number of Fisher Scoring iterations: 1

> plot(x,y,log="xy")
> lines(x,fitted(powfit))

power fit

I highly recommend using Gaussian GLMs with the relevant link function whenever your NLS model can be cast in that form (it can for exponential and power functions, for example).

As to why you're having trouble with it in Matlab, I'm not sure - could it be to do with your convergence criteria? There's no issue with the data at all, but I suggest you start it off at the linear regression coefficients:

> lm(log(y)~log(x))

Call:
lm(formula = log(y) ~ log(x))

Coefficients:
(Intercept)       log(x)  
    -24.203        7.568

So start your parameters at $\exp(-24.2)$ and $7.57$ and see how that works. Or even better, do it the easy way and you can forget about starting values altogether.

Here's my results in R using nls, first, starting with $b$ at the value from the linear log-log fit:

nls(y~a*x^b,start=list(a=exp(-24.2),b=7.57))
Nonlinear regression model
  model: y ~ a * x^b
   data: parent.frame()
        a         b 
1.290e-11 8.063e+00 
 residual sum-of-squares: 2.215e-13

Number of iterations to convergence: 10 
Achieved convergence tolerance: 1.237e-07

And then shifting to starting at b=10:

nls(y~a*x^b,start=list(a=exp(-24.2),b=10))
Nonlinear regression model
  model: y ~ a * x^b
   data: parent.frame()
        a         b 
1.290e-11 8.063e+00 
 residual sum-of-squares: 2.215e-13

Number of iterations to convergence: 19 
Achieved convergence tolerance: 9.267e-08

As you see, they're very close.

The GLM didn't quite match the estimates of the parameters - though a stricter convergence criterion should improve that (glm is only taking one step before concluding it's converged above). It doesn't make much difference to the fit, though. With a bit of fiddling, it's possible to find a closer fit with the GLM (better than the nls), but you start getting down to very tiny deviance (around $10^{-14}$), and I worry if we're really just shifting around numerical error down that far.

Solved – How to choose initial values for nonlinear least squares fit

If there was a strategy that was both good and general -- one that always worked - it would already be implemented in every nonlinear least squares program and starting values would be a non-issue.

For many specific problems or families of problems, some pretty good approaches to starting values exist; some packages come with good start value calculations for specific nonlinear models or with more general approaches that often work but may have to be helped out with more specific functions or direct input of start values.

Exploring the space is necessary in some situations but I think your situation is likely to be such that more specific strategies will likely be worthwhile - but to design a good one pretty much requires a lot of domain knowledge we're unlikely to possess.

For your particular problem, likely good strategies can be designed, but it's not a trivial process; the more information you have about the likely size and extent of the peak (the typical parameter values and typical $x$'s would give some idea), the more can be done to design a good starting value choice.

What are the typical ranges of $y$'s and $x$'s you get? What do the average results look like? What are the unusual cases? What parameter values do you know to be possible or impossible?

One example - does the Gaussian part necessarily generate a turning point (a peak or trough)? Or is it sometimes so small relative to the line that it doesn't? Is $A$ always positive? etc.

Some sample data would help - typical cases and hard ones, if you're able.

Edit: Here's an example of how you can do fairly well if the problem isn't too noisy:

Here's some data that is generated from your model (population values are A = 1.9947, B = 10, C = 2.828, D = 0.09, E = 5):

nls data

The start values I was able to estimate are
(As = 1.658, Bs = 10.001, Cs = 3.053, Ds = 0.0881, Es = 5.026)

The fit of that start model looks like this:

nlstart

The steps were:

Fit a Theil regression to get a rough estimate of D and E
Subtract the fit of the Theil regression off
Use LOESS to fit a smooth curve
Find the peak to get a rough estimate of A, and the x-value corresponding to the peak to get a rough estimate of B
Take the LOESS fits whose y-values are > 60% of the estimate of A as observations and fit a quadratic
Use the quadratic to update the estimate of B and to estimate C
From the original data, subtract off the estimate of the Gaussian
Fit another Theil regression to that adjusted data to update the estimate of D and E

In this case, the values are very suitable for starting a nonlinear fit.

I wrote this as R code but the same thing could be done in MATLAB.

I think better things than this are possible.

If the data are very noisy, this won't work at all well.

Edit2: This is the code I used in R, if anyone is interested:

gausslin.start <- function(x,y) {

  theilreg <- function(x,y){
    yy <- outer(y, y, "-")
    xx <- outer(x, x, "-")
    z  <- yy / xx
    slope     <- median(z[lower.tri(z)])
    intercept <- median(y - slope * x)
    cbind(intercept=intercept,slope=slope)
  }

  tr <- theilreg(x,y1)
  abline(tr,col=4)
  Ds = tr[2]
  Es = tr[1]
  yf  <- y1-Ds*x-Es
  yfl <- loess(yf~x,span=.5)

  # assumes there are enough points that the maximum there is 'close enough' to 
  #  the true maximum

  yflf   <- yfl$fitted    
  locmax <- yflf==max(yflf)
  Bs     <- x[locmax]
  As     <- yflf[locmax]

  qs     <- yflf>.6*As
  ys     <- yfl$fitted[qs]
  xs     <- x[qs]-Bs
  lf     <- lm(ys~xs+I(xs^2))
  bets   <- lf$coefficients
  Bso    <- Bs
  Bs     <-  Bso-bets[2]/bets[3]/2
  Cs     <- sqrt(-1/bets[3])
  ystart <- As*exp(-((x-Bs)/Cs)^2)+Ds*x+Es

  y1a <- y1-As*exp(-((x-Bs)/Cs)^2)
  tr  <- theilreg(x,y1a)
  Ds  <- tr[2]
  Es  <- tr[1]
  res <- data.frame(As=As, Bs=Bs, Cs=Cs, Ds=Ds, Es=Es)
  res
}

.

# population parameters: A = 1.9947 , B = 10, C = 2.828, D = 0.09, E = 5
# generate some data
set.seed(seed=3424921)
x  <- runif(50,1,30)
y  <- dnorm(x,10,2)*10+rnorm(50,0,.2)
y1 <- y+5+x*.09 # This is the data
xo <- order(x)

starts <- gausslin.start(x,y1)
ystart <- with(starts, As*exp(-((x-Bs)/Cs)^2)+Ds*x+Es)
plot(x,y1)
lines(x[xo],ystart[xo],col=2)

Best Answer

Related Solutions

Solved – Strange outcome when performing nonlinear least squares fit to a power law

Solved – How to choose initial values for nonlinear least squares fit

.

Related Question