Solved – How to estimate confidence interval of a least-squares fit parameters by means of numerical Jacobian

confidence intervalcorrelationcurve fittingestimationregression

I am working on a complicated data fitting algorithm in Matlab. I have a problem with properly estimating the confidence intervals of my fit. I will describe my procedure in some detail, give some of my thoughts on the problem and subsequently formulate my question more precisely.

Disclaimer: I am a physicist, not a statistician. Please respond in technical, but not discipline-specific language if possible.

What do I do?

I have a complicated data set and an even more complicated model for it which I don't want to describe here since that is not relevant. I do a least squares fit of the data set with respect to $P$ – a vector of fit parameters (I am not using maximum likelihood estimation yet). This part works very well and I've tested it extensively on known (both real and artificial) data-sets.

I subsequently aim to follow this procedure in order to estimate the prediction bounds on my fit parameters. From now on I will use the same variable nomenclature so I recommend reading the procedure now.

Problem

My $f(x,P)$ – the fit function is purely numerical, and so I use a numerical estimate of $J_{f}$ (using this code). $J_f$ is then a $\texttt{length}(P) \times N$ matrix – Number of data points $\times$ number of fit parameters evaluated at the best fit parameters – $P'$.
$$
J_{f} = \left[\begin{array}{c c c c}\frac{\partial f(P',x_1)}{\partial a_1} & \frac{\partial f(P',x_2)}{\partial a_1} & \dots & \frac{\partial f(P',x_N)}{\partial a_1}\\ \frac{\partial f(P',x_1)}{\partial a_2} & \frac{\partial f(P',x_2)}{\partial a_2} & \dots & \frac{\partial f(P',x_N)}{\partial a_2}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f(P',x_1)}{\partial a_k} & \frac{\partial f(P',x_2)}{\partial a_k} & \dots & \frac{\partial f(P',x_N)}{\partial a_k}\end{array}\right]
$$
I then estimate the Hessian as
$$
H \approx J_{f}^TJ_{f}
$$
such that a matrix element of $H$ is
$$
H_{ij} = \sum_{n = 1}^{N}\left( \frac{\partial f(P',x_n)}{\partial a_i} \times \frac{\partial f(P',x_n)}{\partial a_j} \right)
$$
(sidenote: Hessian obtained like that seems to agree with numerical Hessian obtained using this code).

$H$ is the observed Fisher information matrix (according to this). Moreover an inverse of Fisher information matrix is an estimator of the asymptotic covariance matrix so
$$
C \approx \sigma_r H^{-1}
$$
where $\sigma_r$ is an unbiased variance of the residuals.

The standard errors of P are therefore
$$
P_{se} = \sqrt{\sigma_r\, \texttt{diag}(H^{-1})} \times \texttt{tinv}(1-0.05/2,v)
$$
where the last term is $\approx$ 1.96 – Student's t inverse cumulative distribution function factor for 95% confidence interval, and $v$ is the number of degrees of freedom. This assumes a normal distribution of the errors in the fitted data.

Questions

Is this procedure at all valid?
When estimating the Hessian matrix should it in fact be
$$
H \approx \frac{1}{N}J_{f}^{T}J_{f}
$$
such that
$$
H_{ij} = \,{\bf\frac{1}{N}}\;\sum_{n = 1}^{N}\left( \frac{\partial f(P',x_n)}{\partial a_i} \times \frac{\partial f(P',x_n)}{\partial a_j} \right)
$$
? This would mean that H is invariant with respect to $N$
Instead of getting the Jacobian of $f(x,P)$ (which is a vector valued function) should I instead use Jacobian of the root-mean-square error ($\varepsilon$) of my fit?

Root-mean-square error function:
$$
\varepsilon(P,x,y)= \frac{1}{N}\left(\sum\limits_{i=1}^N (f(x,P)-y)^2\right)^{\frac{1}{2}}
$$
Jacobian:
$$
J_{\varepsilon} = \left[\begin{array}{c}\frac{\partial \varepsilon(P',x,y)}{\partial a_1} \\ \frac{\partial \varepsilon(P',x,y)}{\partial a_2}\\ \vdots\end{array}\right]
$$
Standard errors:
$$
H \approx (J_e^TJ_e)
$$
etc.

Best Answer

If I am understanding the question properly, you are asking about the non-linear least squares (NLS) model. Sticking to your notation, the NLS model is: \begin{align} y_n &= f(x_n,P) + \epsilon_n \end{align} Various regularity conditions are required on $f$, $X$, and $\epsilon$. The parameters are $P$, a $k$-vector. They are estimated by solving: \begin{align} min_P \sum_{n=1}^N \left(y_n-f(x_n,P) \right)^2 \end{align} Assuming that you have found the global minimum and that it is interior to whatever the feasible set is for $P$ ($\mathbb{R}^k$, I guess?), the necessary FOC are: \begin{align} J_f\left(Y-f(X,P)\right)=0 \end{align} The estimator, $P'$, defined as the interior solution to the minimization problem above and therefore solving the first-order condition immediately above, is consistent and asymptotically normal. It has an asymptotic variance of $\sigma^2_rH^{-1}$. The variance of the error in the original model is assumed to be $Var(\epsilon_n)=\sigma^2_r$, and it may be consistently estimated as $\hat{\sigma}_r^2=\frac{1}{N-k}\sum_{n=1}^N\left(y_n-f(x_n,P')\right)^2$.

These are standard results for the non-linear least squares model. My favorite reference for them is Amemiya, T (1985) Advanced Econometrics, section 4.3.

The upshot of all this is that, if you want a 95% confidence interval for, say, the $3^{rd}$ element of $P$, you can use: \begin{align} P_{3}' \pm 1.96\sqrt{\hat{\sigma}_r^2\left(H^{-1}\right)_{33}} \end{align}

Notice that one of the regularity conditions is that all the variances of the $\epsilon$ are equal and that all of the covariances among elements of the $\epsilon$ are zero. The model in the link you provided contemplates that different $\epsilon_n$ have different variances (i.e. heteroskedasticity).

If that characterizes your application, then you need to modify your variance matrix. You can use a so-called sandwich estimator, like this: \begin{align} C &= \left( J_f^T J_f \right)^{-1}J_f^T \hat{\Sigma} J_f \left( J_f^T J_f \right)^{-1}\\ \hat{\Sigma} &= diag \left(\left(y_n-f(x_n,P') \right)^2 \right) \end{align} Then, the 95% confidence interval for the $3^{rd}$ element of $P$, would be: \begin{align} P_{3}' \pm 1.96\sqrt{C_{33}} \end{align} Further modifications would be required if there were to be correlations among the various elements of $\epsilon$.

Related Solutions

Solved – How to get a confidence interval on parameters that were fitted using multiple functions and datasets at once

The easiest thing for you to do would be to string the data sets together (rbind() in R jargon, append in Stata jargon), so that you have variables $x_1,y_1$ coming from the first data set, $x_2,y_2$ coming from the second data set, and an identifier $I$ that takes the value of 0 for the first data set and the value of 1 for the second data set. Construct your response variable $r=y_1 (1-I) + y_2 I$ that would correspond to $y_1$ or $y_2$ depending what data set a particular record came from. Likewise, construct the expanded fit function $f=f(x_1,P) (1-I) + f'(x_2,P) I$ . Then treat the whole thing as if it were a single data set.

Example: the first data set is

The second data set is

The combined data set is

    x1     y1    x2   y2    I   r 
    2.3   5.6     0    0    0   5.6
    1.7   4.5     0    0    0   4.5
    0.8   2.2     0    0    0   2.2
    0     0    11.0  25.4   1  25.4
    0     0     8.3  21.2   1  21.2
    0     0     7.5  19.3   1  19.3

I would expect your standard errors and confidence intervals to be wrong with this method though. You assume implicitly that the variance of the error terms is constant (a homoskedastic nonlinear regression model), and that is a strong assumption to make. You would be better off with a so-called sandwich formula, for which you need the second derivatives, as well. You can probably find all the theory in Ron Gallant's book or in Bates and White's book, but I doubt you'd be able to handle these references. Bruce Hansen's lecture notes (Ch. 6 in particular) might be helpful, too.

Solved – How to calculate a confidence interval for the percentage change between two means

No, it is not valid:

It can produce nonsense: for example if $a_1=-1$, $b_1=2$, $a_2=-2$ and $b_2=3$ then your statement would produce a confidence interval for the percentage change starting at -200% and increasing to -400%.
The width of the confidence interval ought to depend on any dependence between the two random variables.
Even if the the random variables are independent and positive, the confidence interval will depend on their distribution.