[Math] Linear Fit when Data has Uncertainty

linear algebralinear regressionstatisticsweighted least squares

I am attempting to find the slope and y-intercept (along with their uncertainty) from a set of data. In this case, I am graphing Gamma Energy (MeV) vs. Peak Centroid (Channel). Here is my data:

Gamma Energy (MeV): 1.17, 1.33, 0.032, 0.662, 0.511, 1.275, 0.088

Peak Centroid (Channel): 622.65, 712, 21, 360.38, 280.64, 676.85, 18.68

Peak Centroid Uncertainty: 0.0342, 0.0347, 2, 0.0155, 0.0231, 0.0288, 0.1346

As can be seen, one Centroid Uncertainty value is significantly higher than the rest, so I cannot use a normal non-weighted least squares fit. I researched this a bit and found that I can use the formula $\left ( X^TWX \right )\hat{\beta }=X^TWy$ to solve for $\hat{\beta}$. Now, I know $y_{i}$ is simply each of my Gamma Energies. I also know $W_{ii}=\frac{1}{\sigma_{i}^2}$ where $\sigma_{i}$ corresponds to each of my Centroid Uncertainties. All other elements in $W$ are $0$. My issue is, I do not know how to contruct $X$. I have looked at various sources, and none of them really made sense to me. What exactly does each element correspond to, and how do I implement that in my case. When I find $\hat{\beta}$, won't this simply correspond to the $m$ and $b$ values in $y=mx+b$? How can I find the uncertainty in $m$ and $b$? Also, in my case, the uncertainties are in the x-values where (I think) the formula I am using assumes them to be in the y-values. Will this affect my results? As an aside, if I had errors in both x and y-values, how would I solve that?

Best Answer

1) The least square method, in its simplest standard form, assumes that the $x$ be "certain", and the the $y$ subject to error, following the same centered normal distribution, i.e. with the same $\sigma$ (homoscedasticity).

You instead want to find the reverse relation: then, once you have got $y=mx+b$, you can always reverse it into $x=(1/m)y-b/m$.

Trying to avoid confusion let's follow the standard way, understanding that the $x$ refer to the error-free variable, while $y$ to that subject to error.
So we are looking for the regression line $y=mx+b$.

2) We assume to have $n$ independent random variables $Y_k$, whose PDF (probability density function) is $$ \bbox[lightyellow] { p\left( {\eta _k } \right) = {1 \over {\sqrt {2\pi \sigma _k ^2 } }}\exp \left( { - {1 \over 2}{{\left( {\eta _k - \hat m\,x_k - \hat b} \right)^2 } \over {\sigma _k ^2 }}} \right) } \tag{1}$$ where $\hat y = \hat m\,x_k + \hat b$ is the supposed "exact" value of $y$, defined by the the "exact" values of the parameters $m$ and $b$ indicated with the hat.

3) The probability that we get the $n$-uple of values $(y_1,y_2,\cdots,y_n)$, considering the $x$ and $\sigma$ $n$-tuples as fixed, is $$ \bbox[lightyellow] { p\left( {y_1 } \right) \cdots p\left( {y_n } \right)\;dy_1 \cdots dy_n } \tag{2}$$ which is maximum when $$ \bbox[lightyellow] { E^2 (m,b) = \sum\limits_k {{{\left( {y_k - \hat m\,x_k - \hat b} \right)^2 } \over {\sigma _k ^2 }}} = \sum\limits_k {\left( {{{y_k - \hat y_k } \over {\sigma _k }}} \right)^2 } } \tag{2}$$ is minimum, which means $$ \bbox[lightyellow] { \left\{ \matrix{ 0 = {\partial \over {\partial b}}E^2 (m,b)\quad \Rightarrow \quad \sum\limits_k {{{\left( {y_k - \hat m\,x_k - \hat b} \right)} \over {\sigma _k ^2 }}} = 0 \hfill \cr 0 = {\partial \over {\partial m}}E^2 (m,b)\quad \Rightarrow \quad \sum\limits_k {{{\left( {y_k - \hat m\,x_k - \hat b} \right)x_k } \over {\sigma _k ^2 }}} = 0 \hfill \cr} \right. } $$ The first tells us that the estimated value of $b$ (indicated with the tilde) is equal to this expression involving the weighted average of the $y_k$ and $x_k$ values, with weights $(1/\sigma_k^2)$ $$ \bbox[lightyellow] { \eqalign{ & \tilde b = \sum\limits_k {\left( {{{\left( {1/\sigma _k ^2 } \right)} \over {\sum\limits_k {\left( {1/\sigma _k ^2 } \right)} }}} \right)y_k } - \tilde m\sum\limits_k {\left( {{{\left( {1/\sigma _k ^2 } \right)} \over {\sum\limits_k {\left( {1/\sigma _k ^2 } \right)} }}} \right)x_k } = \cr & = \bar y - \tilde m\,\bar x \cr} } \tag{3}$$ So we can rewrite the second as $$ \bbox[lightyellow] { 0 = \sum\limits_k {{{\left( {y_k - \hat m\,x_k - \hat b} \right)x_k } \over {\sigma _k ^2 }}} = \sum\limits_k {{{\left( {\left( {y_k - \bar y} \right) - \hat m\,\left( {x_k - \bar x} \right)} \right)\left( {x_k - \bar x} \right)} \over {\sigma _k ^2 }}} } $$ and get $$ \bbox[lightyellow] { \tilde m = \sum\limits_k {{{\left( {y_k - \bar y} \right)\left( {x_k - \bar x} \right)} \over {\sigma _k ^2 }}} \;\;\mathop /\limits_{} \;\; \sum\limits_k {{{\left( {x_k - \bar x} \right)^2 } \over {\sigma _k ^2 }}} } \tag{4}$$

4) Passing to the more synthetic vectorial notation, indicating by $$ \bbox[lightyellow] { {\bf Y} = \left( {\matrix{ {y_1 } \cr {y_2 } \cr \vdots \cr {y_n } \cr } } \right)\quad {\bf X} = \left( {\matrix{ 1 & {x_1 } \cr 1 & {x_2 } \cr \vdots & \vdots \cr 1 & {x_n } \cr } } \right)\quad {\bf \beta } = \left( {\matrix{ b \cr m \cr } } \right)\quad {\bf W} = \left( {\matrix{ {1/\sigma _1 ^2 } & 0 & \cdots & 0 \cr 0 & {1/\sigma _2 ^2 } & \cdots & 0 \cr \vdots & \vdots & \ddots & \vdots \cr 0 & 0 & \cdots & {1/\sigma _n ^2 } \cr } } \right) } $$ we are going to minimize $$ \bbox[lightyellow] { E^2 ({\bf \beta }) = \left( {{\bf W}^{\,{\bf 1/2}} \,\left( {{\bf Y} - {\bf X\beta }} \right)} \right)^{\,{\bf T}} \left( {{\bf W}^{\,{\bf 1/2}} \,\left( {{\bf Y} - {\bf X\beta }} \right)} \right) } $$ which returns $$ \bbox[lightyellow] { {\bf \tilde \beta } = \left( {{\bf X}^{\,{\bf T}} {\bf WX}} \right)^{ - \,{\bf 1}} {\bf X}^{\,{\bf T}} \,{\bf W}\;{\bf Y} } \tag{4}$$

Refer also to this Wikipedia article.

Applied to your data this gives $$ \bbox[lightyellow] { b=13.27,\;m=522.76 }$$ .

5) Concerning the confidence abour the estimated values of $m$ and $b$, you can find various considerations in the already cited article as well as in this other related one, together with the necessary cautions as per the statistical soundness and applicability.

However, not pretending to be exhaustive nor totally rigorous, but just to outline the basis of a possible approach, let's go back to (1) and (2).
We can consider eq. (2) as proportional to the conditional probability of getting the vector $\bf Y$ , given $\bf \beta$ $$ P\left( {{\bf Y}\,\,\left| {\;{\bf \beta }} \right.} \right) \propto p\left( {y_1 } \right) \cdots p\left( {y_n } \right)\;dy_1 \cdots dy_n $$ Assuming $P(\bf \beta)$ to be fairly constant, at least within the "admissible" range of values, then we can put that $$ \bbox[lightyellow] { \eqalign{ & P\left( {{\bf \beta }\,\,\left| {\,{\bf Y}\;} \right.} \right) \propto p\left( {y_1 } \right) \cdots p\left( {y_n } \right)\;db\,dm \propto \cr & \propto \exp \left( { - {1 \over 2}\sum\limits_k {\left( {{{y_k - \hat m\,x_k - \hat b} \over {\sigma _k }}} \right)^2 } } \right) = \exp \left( { - {1 \over 2}\sum\limits_k {\left( {{{\,x_k } \over {\sigma _k }}\hat m + {1 \over {\sigma _k }}\hat b - {{y_k } \over {\sigma _k }}} \right)^2 } } \right) \cr} } \tag{5}$$ with $x_k, y_k, \sigma_k$ known and fixed.
Let's pay attention to the fact that presuming that the $\sigma_k$ are known, and not to be determined by some estimation on the spreadness of the $y_k$, is a crucial point in fixing the statistics of the parameters (Normal instead of Student).

The sum of the quadratic forms in $b,m$ will provide a quadratic form that can be written as $$ \bbox[lightyellow] { \eqalign{ & Q(b,m) = \sum\limits_k {\left( {{1 \over {\sigma _k }}\hat b + {{\,x_k } \over {\sigma _k }}\hat m - {{y_k } \over {\sigma _k }}} \right)^2 } = \cr & = \left( {T^2 \left( {\hat b - \tilde b} \right)^2 + 2R^2 \left( {\hat b - \tilde b} \right)\left( {\hat m - \tilde m} \right) + U^2 \left( {\hat m - \tilde m} \right)^2 + V^2 } \right) \cr} } \tag{6}$$ where it is clear that for $\hat b = \tilde b,\;\hat m = \tilde m$ $Q(b,m)$ attains the minimum value, which is the condition defining the estimated values.

Taking the first and second derivatives it is possible to express $T,R,U,V$ wrt the known values. $$ \bbox[lightyellow] { \left\{ \matrix{ T^{\,2} = \sum\limits_{1\, \le \,k\, \le \,n} {\left( {1/\sigma _{\,k} } \right)^{\,2} } \hfill \cr U^{\,2} = \sum\limits_{1\, \le \,k\, \le \,n} {\left( {x_{\,k} /\sigma _{\,k} } \right)^{\,2} } \hfill \cr R^{\,2} = \sum\limits_{1\, \le \,k\, \le \,n} {\left( {x_{\,k} /\sigma _{\,k} ^{\,2} } \right)} \hfill \cr V^{\,2} = \sum\limits_k {\left( {{1 \over {\sigma _k }}\tilde b + {{\,x_k } \over {\sigma _k }}\tilde m - {{y_k } \over {\sigma _k }}} \right)^2 } \hfill \cr} \right. } $$

We can translate the above into the equation of a Bivariate Normal Distribution, absorbing the term in $V$ into the normalizing factor. $$ \bbox[lightyellow] { {1 \over {2\,\pi \,\sigma _{\,1} \,\sigma _{\,2} \sqrt {1 - \rho ^{\,2} } }}\exp \left( { - {1 \over {2\left( {1 - \rho ^{\,2} } \right)}}\left( {{{\left( {x_{\,1} - \mu _{\,1} } \right)^{\,2} } \over {\sigma _{\,1} ^{\,2} }} - 2\rho {{\left( {x_{\,1} - \mu _{\,1} } \right)\left( {x_{\,2} - \mu _{\,2} } \right)} \over {\sigma _{\,1} \,\sigma _{\,2} }} + {{\left( {x_{\,2} - \mu _{\,2} } \right)^{\,2} } \over {\sigma _{\,2} ^{\,2} }}} \right)} \right) } $$ where $\rho$ is the correlation $$ \bbox[lightyellow] { \rho = {\rm corr}\left( {x_{\,1} ,x_{\,2} } \right) = {{V_{12} } \over {\sigma _{\,1} \,\sigma _{\,2} }} } $$

Finally we reach to $$ \bbox[lightyellow] { \left\{ \matrix{ \rho = - {{R^{\,2} } \over {T\,U}} \approx - \,0.93396 \hfill \cr \sigma _{\,b} = {1 \over {T\sqrt {\left( {1 - \rho ^{\,2} } \right)} }} \approx {\rm 0}{\rm .029524} \hfill \cr \sigma _{\,m} = {1 \over {U\sqrt {\left( {1 - \rho ^{\,2} } \right)} }} \approx {\rm 0}{\rm .033654} \hfill \cr} \right. } \tag{7}$$ which result I could counter-check with my old CAS program.

From here you can easily find the ellipse which delimites the parameters at the required confidence.

Now you can reverse the relation $x->y$.

Related Question