Is the Theil-Sen estimation in robust regression only limited to a two dimensional problem or can you use it for more than one independent variable as well?
Solved – Theil-Sen estimation, more than one independent variable
regressionrobust
Related Solutions
The Theil-Sen estimator is essentially an estimator for the slope alone; the line has been constructed in a host of different ways - there are a large variety of ways to calculate the intercept.
You said:
My understanding of the intercept calculation is that I first calculate the median slope, and then construct a line through every data point with this slope, find the intercept of every line, and then take the median intercept.
A common one (probably the most common) is to compute median($y-bx$). This is what Sen looked at, for example; if I understand your intercept definition correctly this is the same as the intercept you mention.
There are a couple of approaches that compute the intercept of the line through each pair of points and attempts to get some kind of weighted-median but based off that (putting more weight on the points further apart in x-space).
Another is to try to get an estimator with higher efficiency at the normal (akin to that of the slope estimator in typical situations) and similar breakdown point to the slope estimate (there's probably little point in having better breakdown at the expense of efficiency), such as using the Hodges-Lehmann estimator (median of pairwise averages) on $y-bx$. This has a kind of symmetry in the way the slopes and intercepts are defined ... and generally gives something very close to the LS line when the normal assumptions nearly hold, whereas the Sen-intercept can be - relatively speaking - quite different.
Some people just compute the mean residual.
There are still other suggestions that have been looked at. There's really no 'one' intercept to go with the slope estimate.
Dietz lists several possibilities, possibly even including all the ones I mentioned, but that's by no means exhaustive.
James Phillips' suggestion on how to expand the Theil-Sen algorithm to a second degree polynomial worked surprisingly well. There were 762 (x,y)-points in the dataset I tested. Selecting three different points from the 762 can be made in 73 million ways, so instead I put the points into groups of 11 and calculated the median x och y values of each group. $\binom{762/11}{3} = \binom{69}{3} \approx 52000$ which is a more reasonable number of combinations to use.
For each combination of three points, I calculated the $a$ coefficient for $y = ax^2 + bx + c$ using:
$a = \frac{x_3(y_2 - y_1) + x_2(y_1 - y_3) + x_1(y_3 - y_2)}{(x_1 - x_2)(x_1 - x_3)(x_2 - x_3)}$
Using the median of all $a$'s, for each combination of two points I calculated the slope of the line $y - ax^2 = bx + c$ using:
$b = \frac{(y_2 - ax_2^2) - (y_1 - ax_1^2)}{x_2 - x_1}$
Finally, using the the medians of all $a$'s and $b$'s, for each point I calculated the intercept of $y - ax^2 - bx = c$ using:
$c = y_n - ax_n^2 - bx_n$
I used the Remedian median value approximation and the algorithm implemented in C# took ~15 ms to run for 69 points. If the initial median filter is reduced to 5 points, the execution time increases to ~110 ms which still is ok. Running the algorithm on all 762 points results in an execution time of 13 seconds.
Even with the fast 11 point filter, the results looks very good:
Best Answer
There have been a number of proposals for extending Theil-Sen estimation to multiple regression contexts.
I'll point to a couple:
1)
Zhou, W. and R. Serfling (2007),
Multivariate Spatial U-Quantiles: a Bahadur-Kiefer Representation, a Theil-Sen Estimator for Multiple Regression, and a Robust Dispersion Estimator,
Journal of Statistical Planning and Inference, May
see here
2)
Wang, X., X. Dang, H. Peng, and H. Zhang (2009),
The Theil-Sen Estimators in a Multiple Linear Regression Model
see here or here (different versions)
The first is based on extending univariate U-quantiles to multivariate U-quantiles, and the second is based on a multivariate median.