Solved – Theil-Sen estimator for polynomial

curve fittingrobust

We're using the Theil-Sen estimator for fitting a line (orange) through mostly straight data. It works very well compared to the linear trendline (blue) in Excel:

Straight

We also need to fit mostly curved data with outliers. Is there something similar to the Theil-Sen estimator we could use? Excel's second degree polynomial trendline (blue) for all points, results in a bad fit due to the outliers on the right. When the data is manually truncated to 0 < x < 1.9, the polynomial trendline (orange) fits a lot better. But the outliers may be present in the middle of the data as well.

Curved

A second degree polynomial fits our data good enough, but other curves may be fitted as well. Our goal is to get the y position and slope of the fitted curve at each x.

Excel data available here.

Best Answer

James Phillips' suggestion on how to expand the Theil-Sen algorithm to a second degree polynomial worked surprisingly well. There were 762 (x,y)-points in the dataset I tested. Selecting three different points from the 762 can be made in 73 million ways, so instead I put the points into groups of 11 and calculated the median x och y values of each group. $\binom{762/11}{3} = \binom{69}{3} \approx 52000$ which is a more reasonable number of combinations to use.

For each combination of three points, I calculated the $a$ coefficient for $y = ax^2 + bx + c$ using:

$a = \frac{x_3(y_2 - y_1) + x_2(y_1 - y_3) + x_1(y_3 - y_2)}{(x_1 - x_2)(x_1 - x_3)(x_2 - x_3)}$

Using the median of all $a$'s, for each combination of two points I calculated the slope of the line $y - ax^2 = bx + c$ using:

$b = \frac{(y_2 - ax_2^2) - (y_1 - ax_1^2)}{x_2 - x_1}$

Finally, using the the medians of all $a$'s and $b$'s, for each point I calculated the intercept of $y - ax^2 - bx = c$ using:

$c = y_n - ax_n^2 - bx_n$

I used the Remedian median value approximation and the algorithm implemented in C# took ~15 ms to run for 69 points. If the initial median filter is reduced to 5 points, the execution time increases to ~110 ms which still is ok. Running the algorithm on all 762 points results in an execution time of 13 seconds.

Even with the fast 11 point filter, the results looks very good:

2nd degree Theil-Sen