Solved – Intercept calculation in Theil-Sen estimator

regression

I've just written some C code for Theil-Sen, after some Googling (I don't have any definitive documentation on it). My understanding of the intercept calculation is that I first calculate the median slope, and then construct a line through every data point with this slope, find the intercept of every line, and then take the median intercept.

The only way I can find to test the code is to compare the results against the Kendall-Theil Robust Line program, from the USGS. On a dataset of 237 points (healthcare data, with a Pearson correlation of ~0.55), we agree exactly on the median slope, but disagree on the intercept (by 1.4%). According to my figures, the KTRL intercept isn't the median intercept, but is instead 46% of the way through the range.

After some digging around in the KTRL code, it appears that they calculate the intercept by creating a single 'median line', rather than the median of all intercepts. Their intercept is medianY - medianX * median slope.

Any feedback on which is the "right" way to do this, if there is one, or how this is handled in R/etc?

Thanks.

Best Answer

The Theil-Sen estimator is essentially an estimator for the slope alone; the line has been constructed in a host of different ways - there are a large variety of ways to calculate the intercept.

You said:

My understanding of the intercept calculation is that I first calculate the median slope, and then construct a line through every data point with this slope, find the intercept of every line, and then take the median intercept.

A common one (probably the most common) is to compute median($y-bx$). This is what Sen looked at, for example; if I understand your intercept definition correctly this is the same as the intercept you mention.

There are a couple of approaches that compute the intercept of the line through each pair of points and attempts to get some kind of weighted-median but based off that (putting more weight on the points further apart in x-space).

Another is to try to get an estimator with higher efficiency at the normal (akin to that of the slope estimator in typical situations) and similar breakdown point to the slope estimate (there's probably little point in having better breakdown at the expense of efficiency), such as using the Hodges-Lehmann estimator (median of pairwise averages) on $y-bx$. This has a kind of symmetry in the way the slopes and intercepts are defined ... and generally gives something very close to the LS line when the normal assumptions nearly hold, whereas the Sen-intercept can be - relatively speaking - quite different.

Some people just compute the mean residual.

There are still other suggestions that have been looked at. There's really no 'one' intercept to go with the slope estimate.

Dietz lists several possibilities, possibly even including all the ones I mentioned, but that's by no means exhaustive.