Solved – How to interpret the covariance matrix from a curve fit

covariance-matrixcurve fittingmodel selectionpythonvariance

I'm not too great at statistics, so apologies if this is a simplistic question. I am fitting a curve to some data, and sometimes my data best fits a negative exponential in the form $a * e^{(-b * x)} + c$, and sometimes the fit is closer to $a * e^{(-b * x^2)} + c$. However, sometimes both of those fail, and I would like to fall back to a linear fit. My question is, how can I determine which model fits a particular data set the best from the resulting variance-covariance matrix that is returned from the scipy.optimize.curve_fit() function? I believe the variance is on one of the diagonals of this matrix, but I'm not sure how to interpret that.

UPDATE: Based on a similar question, I'm hoping that the variance-covariance matrix can tell me which of the three models I am attempting best fits the data (I am trying to fit many datasets to one of these three models).

The resulting matrices look like this for the given example:

pcov_lin 
[[  2.02186921e-05  -2.02186920e-04]
 [ -2.02186920e-04   2.76322124e-03]]
pcov_exp
[[  9.05390292e+00  -7.76201283e-02  -9.20475334e+00]
 [ -7.76201283e-02   6.69727245e-04   7.90218415e-02]
 [ -9.20475334e+00   7.90218415e-02   9.36160310e+00]]
pcov_exp_2 
[[  1.38338049e-03  -7.39204594e-07  -7.81208814e-04]
 [ -7.39204594e-07   8.99295434e-09   1.92970700e-06]
 [ -7.81208814e-04   1.92970700e-06   9.14746758e-04]]

Here is an example of what I am doing:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import scipy.optimize

def exp_func(x, a, b, c):
    return a * np.exp(-b * x) + c

def exp_squared_func(x, a, b, c):
    return a * np.exp(-b * x*x*x) + c

def linear_func(x, a, b):
    return a*x + b

def main():
    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], np.float)
    y = np.array([1, 1, 1, 1, 0.805621, 0.798992, 0.84231, 0.728796, 0.819471, 0.570414, 0.355124, 0.276447, 0.159058, 0.0762189, 0.0167807, 0.0118647, 0.000319948, 0.00118267, 0, 0, 0], np.float)

    p0 = [0.7746042467213462, 0.10347274384077858, -0.016253458007293588]
    popt_lin, pcov_lin      = scipy.optimize.curve_fit(linear_func, x, y)
    popt_exp, pcov_exp      = scipy.optimize.curve_fit(exp_func, x, y)
    popt_exp_2, pcov_exp_2  = scipy.optimize.curve_fit(exp_squared_func, x, y)

    plt.figure()
    plt.plot(x, y, 'ko', label="Original data")
    plt.plot(x, linear_func(x, *popt_lin), 'r-', label='linear')
    plt.plot(x, exp_func(x, *popt_exp), 'b-', label='exponential')
    plt.plot(x, exp_squared_func(x, *popt_exp_2), 'g-', label='exponential squared')
    plt.legend()
    plt.show()

if __name__ == '__main__':
    main()

Best Answer

As a clarification, the variable pcov from scipy.optimize.curve_fit is the estimated covariance of the parameter estimate, that is loosely speaking, given the data and a model, how much information is there in the data to determine the value of a parameter in the given model. So it does not really tell you if the chosen model is good or not. See also this.

The problem what is a good model is indeed a hard problem. As argued by statisticians

All models are wrong, but some are useful

So the criteria to use in comparison of different models depends on what you want to achieve.

For instance, if you want a curve that is the "close as possible" to the data, you could select a model which gives the smallest residual. In your case it would be the model func and the estimated parameters popt that has the lowest value when computing

numpy.linalg.norm(y-func(x, *popt))

However, if you select a model with more parameters, the residual will automatically decrease, at the cost of higher model complexity. So then it comes back to what the goal is of the model.