Solved – Perpendicular offsets in a weighted least squares regression

least squaresMATLABrregression

Perpendicular offset least square fitting has a lot of advantages compared to the native least square fitting scheme. The following figure illustrates the difference between there, and for a more detailed comparison of these two methods, we refer to here.

enter image description here

Perpendicular offset least square fitting, however, is not robust to outliers( points that are not supposed to be used for model estimation). Therefore, I am now considering to use a weighted perpendicular offset least square regression method. The method has two steps:

  1. Calculate the weighting factor for each points that are going to be used for line estimation;
  2. Perform perpendicular offset in a weighted least square regression scheme.

For the time being, my biggest problem comes from step 2. Suppose the weighting factors are given, how can I get the formula to estimate the parameters of the line? Many thanks!

EDIT:

Based on the kind suggestion of @MvG I have implemented the algorithm in MATLAB:

function  line =  estimate_line_ver_weighted(pt_x, pt_y,w);
% pt_x  x coordinate
% pt_y  y coordinate
% w     weighting factor


pt_x = pt_x(:);
pt_y = pt_y(:);
w    = w(:);


% step 1: calculate n
n = sum(w(:));

% step 2: calculate weighted coordinates 
y_square = pt_y(:).*pt_y(:);
x_square = pt_x(:).*pt_x(:);
x_square_weighted = x_square.*w;  
y_square_weighted = y_square.*w;  
x_weighted        = pt_x.*w;
y_weighted        = pt_y.*w;

% step 3: calculate the formula
B_upleft = sum(y_square_weighted)-sum(y_weighted).^2/n;
B_upright = sum(x_square_weighted)-sum(x_weighted).^2/n;
B_down = sum(x_weighted(:))*sum(y_weighted(:))/n-sum(x_weighted.*pt_y);
B = 0.5*(B_upleft-B_upright)/B_down;

% step 4: calculate b
if B<0
    b       = -B+sqrt(B.^2+1);
else
    b       = -B-sqrt(B.^2+1);
end

% Step 5: calculate a
a = (sum(y_weighted)-b*sum(x_weighted))/n;

% Step 6: the model is y = a + bx, and now we transform the model to 
% a*x + b*y + c = 0;
c_ = a;
a_ = b;
b_ = -1;

line = [a_ b_ c_];

The result is as good as we can expect, which is illustrated in the following script:

%% Procedure 1: given the data
pt_x = [   692   692   693   692   693   693   750];
pt_y = [ 919         971        1022        1074        1126        1230        1289];

% Procedure 2: draw the point 
 close all; figure; plot(pt_x,pt_y,'b*');

% Procedure 3: estimate the line based on the weighted vertical offset
% least square method.
 weighting = ones(length(pt_x),1);
 weighting(end) = 0.01;  % we give the last point a low weighting because obvously it is an outlier
 myline =    estimate_line_ver_weighted(pt_x,pt_y,weighting); 
 a = myline(1); b = myline(2); c= myline(3);

 % Procedure 4: draw the line
 x_range = [min(pt_x):0.1:max(pt_x)];
 y_range = [min(pt_y):0.1:max(pt_y)];
 if length(x_range)>length(y_range)
        x_range_corrspond = -(a*x_range+c)/b;
        hold on; plot(x_range,x_range_corrspond,'r');
 else
        y_range_correspond = -(b*y_range+c)/a;
        hold on; plot(y_range_correspond,y_range,'r');
 end

The following figure corresponds to the above script:
enter image description here.

Best Answer

Completely revised answer, see history.

Take the formula from your link. It contains a lot of sums iterating over your input points. Make sure to multiply the summands in all of these sums with your weights $w$:

\begin{align*} \sum_{i=1}^n x_i &\to \sum_{i=1}^n w_ix_i \\ \sum_{i=1}^n y_i &\to \sum_{i=1}^n w_iy_i \\ \sum_{i=1}^n x_i^2 &\to \sum_{i=1}^n w_ix_i^2 \\ \sum_{i=1}^n x_iy_i &\to \sum_{i=1}^n w_ix_iy_i \\ \sum_{i=1}^n y_i^2 &\to \sum_{i=1}^n w_iy_i^2 \\ n = \sum_{i=1}^n 1 &\to \sum_{i=1}^n w_i \end{align*}

Notice that I previously suggested weighting the coordinates, but that causes one $w$ too many for the second-order terms. To simulate the effect of $w$ denoting the multiplicity of points (i.e. $w_i=3$ should have the same effect as point $i$ repeated $3$ times), you have to have exactly one $w$ for every sum iterating over your set of points. Your code still has one $w$ too many in the sum(x_weighted.*y_weighted) term of B_down.

With this solution, and using exact arithmetic on algebraic numbers to avoid numeric issues, one of the two solutions of the quadratic equation gives a pretty good result on the example data you provided. Seeing as $B$ is only around $22$ with the correct computation, numeric issues shouldn't be to serious a problem, contrary to my previous experiences with the incorrect weighting. I still don't know which solution will be the correct one in general, whether you can always choose the one with the positive square root, or whether you have to examine the sign of the second derivative.

Related Question