Which is better: Least Squares Regression or some Unknown Method I found the coworker using

extrapolationleast squareslinear regressionmeansregression

It's been a while since I did some good old regression, but I've been given a dataset with some simple x and y data and need to extrapolate to predict a value of y for a given x. The x data is just a set of increasing integers 1, 2, 3... up to some value, and the y data is cumulative so monotonically increasing.

The relationship between x and y can be assumed to be linear, so I thought it would be a good idea to perform a linear least-squares regression and use the resulting model to find my predicted x value (that is, x' = (y' - c)/m, where x' is the predicted x for the given value of y', m is the gradient and c is the y-intercept).

However, when I went to check this over with a coworker, they'd taken a different approach.

As the data was cumulative, they'd instead taken the non-cumulative y data (which would just be y[i] - y[i-1]) and found the mean to use as a gradient to predict the expected value of x using the greatest (x,y) pair in the dataset as a starting point (i.e. x' = (y' - y[last])/m + x[last], where x' is the predicted x for the given value of y', m is the gradient and x[last] and y[last] are the last and greatest values in the dataset.

I understand that if our data was perfectly linear then we'd get the same result for x', however in reality the results will always differ, is either method advantageous? I'd have thought least-squares regression would be better but I had no evidence to back that up, and I also can't imagine my coworker's method working very well if the data wasn't monotonically increasing. Any thoughts on the matter would be greatly appreciated.

Best Answer

Linear regression is for cases where you have Gaussian noise in the measured quantity, that is, for problems where $y_{measured} = y_{true}+\eta = a x +b +\eta$ where $\eta$ is a noise term. In the following I will assume that $x_n=n x_1$ ( no initial factor in the $x$)

Your colleague and you are dealing with two different cases of the source of the noise:

a)If each $dy[j]$ is noisy in the same way, then the cumulative noise increases in each step. Then, regressing $dy[i]=a +\eta_i$ is the correct approach, and $<dy>$ is the regression estimate for $a$. $b$ should be zero in this case.

b) if the $dy[i]$ are exact but your measurement adds a noise term to the cumulative sum, your approach ( “vanilla” regression) is the right way to go. This usually means that apparatus adds noise, for example, a meter has a noisy readout. If you do not expect the apparatus to add an additional bias, you do not need the $b$ term in the regerssion, only $y_{measured}=y_{true}+\eta =ax+\eta$.

As for the question of $x$ given $y$, your model is either(a) $y[i]_{measured}=ax[i]+\sum^{i}_{j=1} \eta[j]$ or (b) $y[i]_{measured}=ax[i]+\eta[i]$

In case (a), $\log P(y|x) ~ -{(y-ax)^2\over 2(\sigma x)^2}$ and in case(b) $\log P(y|x) ~ -{(y-ax)^2\over 2(\sigma)^2}$. This is because the noise gets larger for larger $x$ in the cumulative noise case. In order to decide the “best” $x$ given $y$ you have to find the $x$ that maximizes these expressions in each case. It is not difficult to see that in case (b) this occurs when $ax$ is as close as possible to $y$ so the rounded (with increment $x_1$) value if $y/a$ is the answer, but case (a) is slightly trickier: consider the case where $y_{measured}=3.49$, then $ -(3.49-x)^2/(2(x)^2)$ . For $x=3$ I get -0.013338, while for $x=4$ I get -0.00812, so as you can see, there is a slight bias towards larger values of $x$

Related Question