Solved – How to estimate confidence interval of a least-squares fit parameters by means of numerical Jacobian

confidence intervalcorrelationcurve fittingestimationregression

I am working on a complicated data fitting algorithm in Matlab. I have a problem with properly estimating the confidence intervals of my fit. I will describe my procedure in some detail, give some of my thoughts on the problem and subsequently formulate my question more precisely.

Disclaimer: I am a physicist, not a statistician. Please respond in technical, but not discipline-specific language if possible.

What do I do?

I have a complicated data set and an even more complicated model for it which I don't want to describe here since that is not relevant. I do a least squares fit of the data set with respect to $P$ – a vector of fit parameters (I am not using maximum likelihood estimation yet). This part works very well and I've tested it extensively on known (both real and artificial) data-sets.

I subsequently aim to follow this procedure in order to estimate the prediction bounds on my fit parameters. From now on I will use the same variable nomenclature so I recommend reading the procedure now.

Problem

My $f(x,P)$ – the fit function is purely numerical, and so I use a numerical estimate of $J_{f}$ (using this code). $J_f$ is then a $\texttt{length}(P) \times N$ matrix – Number of data points $\times$ number of fit parameters evaluated at the best fit parameters – $P'$.
$$
J_{f} = \left[\begin{array}{c c c c}\frac{\partial f(P',x_1)}{\partial a_1} & \frac{\partial f(P',x_2)}{\partial a_1} & \dots & \frac{\partial f(P',x_N)}{\partial a_1}\\ \frac{\partial f(P',x_1)}{\partial a_2} & \frac{\partial f(P',x_2)}{\partial a_2} & \dots & \frac{\partial f(P',x_N)}{\partial a_2}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f(P',x_1)}{\partial a_k} & \frac{\partial f(P',x_2)}{\partial a_k} & \dots & \frac{\partial f(P',x_N)}{\partial a_k}\end{array}\right]
$$
I then estimate the Hessian as
$$
H \approx J_{f}^TJ_{f}
$$
such that a matrix element of $H$ is
$$
H_{ij} = \sum_{n = 1}^{N}\left( \frac{\partial f(P',x_n)}{\partial a_i} \times \frac{\partial f(P',x_n)}{\partial a_j} \right)
$$
(sidenote: Hessian obtained like that seems to agree with numerical Hessian obtained using this code).

$H$ is the observed Fisher information matrix (according to this). Moreover an inverse of Fisher information matrix is an estimator of the asymptotic covariance matrix so
$$
C \approx \sigma_r H^{-1}
$$
where $\sigma_r$ is an unbiased variance of the residuals.

The standard errors of P are therefore
$$
P_{se} = \sqrt{\sigma_r\, \texttt{diag}(H^{-1})} \times \texttt{tinv}(1-0.05/2,v)
$$
where the last term is $\approx$ 1.96 – Student's t inverse cumulative distribution function factor for 95% confidence interval, and $v$ is the number of degrees of freedom. This assumes a normal distribution of the errors in the fitted data.

Questions

  • Is this procedure at all valid?
  • When estimating the Hessian matrix should it in fact be
    $$
    H \approx \frac{1}{N}J_{f}^{T}J_{f}
    $$
    such that
    $$
    H_{ij} = \,{\bf\frac{1}{N}}\;\sum_{n = 1}^{N}\left( \frac{\partial f(P',x_n)}{\partial a_i} \times \frac{\partial f(P',x_n)}{\partial a_j} \right)
    $$
    ? This would mean that H is invariant with respect to $N$
  • Instead of getting the Jacobian of $f(x,P)$ (which is a vector valued function) should I instead use Jacobian of the root-mean-square error ($\varepsilon$) of my fit?

Root-mean-square error function:
$$
\varepsilon(P,x,y)= \frac{1}{N}\left(\sum\limits_{i=1}^N (f(x,P)-y)^2\right)^{\frac{1}{2}}
$$
Jacobian:
$$
J_{\varepsilon} = \left[\begin{array}{c}\frac{\partial \varepsilon(P',x,y)}{\partial a_1} \\ \frac{\partial \varepsilon(P',x,y)}{\partial a_2}\\ \vdots\end{array}\right]
$$
Standard errors:
$$
H \approx (J_e^TJ_e)
$$
etc.

Best Answer

If I am understanding the question properly, you are asking about the non-linear least squares (NLS) model. Sticking to your notation, the NLS model is: \begin{align} y_n &= f(x_n,P) + \epsilon_n \end{align} Various regularity conditions are required on $f$, $X$, and $\epsilon$. The parameters are $P$, a $k$-vector. They are estimated by solving: \begin{align} min_P \sum_{n=1}^N \left(y_n-f(x_n,P) \right)^2 \end{align} Assuming that you have found the global minimum and that it is interior to whatever the feasible set is for $P$ ($\mathbb{R}^k$, I guess?), the necessary FOC are: \begin{align} J_f\left(Y-f(X,P)\right)=0 \end{align} The estimator, $P'$, defined as the interior solution to the minimization problem above and therefore solving the first-order condition immediately above, is consistent and asymptotically normal. It has an asymptotic variance of $\sigma^2_rH^{-1}$. The variance of the error in the original model is assumed to be $Var(\epsilon_n)=\sigma^2_r$, and it may be consistently estimated as $\hat{\sigma}_r^2=\frac{1}{N-k}\sum_{n=1}^N\left(y_n-f(x_n,P')\right)^2$.

These are standard results for the non-linear least squares model. My favorite reference for them is Amemiya, T (1985) Advanced Econometrics, section 4.3.

The upshot of all this is that, if you want a 95% confidence interval for, say, the $3^{rd}$ element of $P$, you can use: \begin{align} P_{3}' \pm 1.96\sqrt{\hat{\sigma}_r^2\left(H^{-1}\right)_{33}} \end{align}

Notice that one of the regularity conditions is that all the variances of the $\epsilon$ are equal and that all of the covariances among elements of the $\epsilon$ are zero. The model in the link you provided contemplates that different $\epsilon_n$ have different variances (i.e. heteroskedasticity).

If that characterizes your application, then you need to modify your variance matrix. You can use a so-called sandwich estimator, like this: \begin{align} C &= \left( J_f^T J_f \right)^{-1}J_f^T \hat{\Sigma} J_f \left( J_f^T J_f \right)^{-1}\\ \hat{\Sigma} &= diag \left(\left(y_n-f(x_n,P') \right)^2 \right) \end{align} Then, the 95% confidence interval for the $3^{rd}$ element of $P$, would be: \begin{align} P_{3}' \pm 1.96\sqrt{C_{33}} \end{align} Further modifications would be required if there were to be correlations among the various elements of $\epsilon$.

Related Question