[Math] Simple algorithm effectiveness (Least-Squares Regression Line in Terms of Sample Variances)

algorithmsstatistics

I hope I'm at right place, as I asked this question at stackoverflow.com where the question was closed and suggested I go ask here. I'll try to be more comprehensive also, looking at couple of replies I got there.

The question is simple for those who know how algorithms work, and as I'm physics student I'm not sure about the answer. I had exam in Statistics where professor suggested that formula I used (red outlined) in determining "Least-Squares Regression Line" was computationally inferior to the one derived below (green outline). My simple thinking was that first one uses n squares while derived (n+1) squares, but supposedly that's not enough. Can someone point why could derived formula be computationally more effective?

variance

Thanks

Best Answer

You're comparing the two variance formulae

$$s_x^2=\frac1{n}\sum_{i=1}^n \left(x_i-\bar{x}\right)^2$$

and

$$s_x^2=\frac1{n}\left(\sum_{i=1}^n x_i^2\right)-\bar{x}^2$$

With respect to the computational effort needed, the second formula looks better: it only requires one loop to pass through your data, as opposed to the two needed for the first one (first for computing the mean, and then for adding up the squared deviations).

The truth of the matter, however, is that the first formula makes much more computational sense than the second one. In practice, the two terms being subtracted in the second form can be very large; if the variance is small, you lose a lot of significant figures in the subtraction of two nearly-equal quantities (subtractive cancellation). Worst comes to worst, you may even end up with a negative variance with the second formula!

Related Question