Solved – Appropriate goodness of fit measure for signal with unknown errors

chi-squared-testcurve fittinggoodness of fitscipy

I have a signal (voltage vs time) from a measurement device. The device outputs exactly one datapoint every constant time interval dt. Theoretical reasons lead me to suppose that the data is following a hyperbolic curve $f(x)= \frac{mx}{k+x}$. Using scipy I performed a fit to the data using the curve_fit function (which does a least squares fit).

I asking myself what would be an appropriate goodness of fit measurement if I don't know the error of the data. Additionally, is it possible to get an estimate of the parameter uncertainty?

I first thought about a Chi-squared test (see my question Chi squared test for goodness of fit ) but there the errors (at least in y) should be known.

Best Answer

If you estimate the curve parameters (i.e. $m$ and $k$) using least squares, then you are implicitly using the root-mean-squared error as the misfit metric (i.e. objective to be minimized).

In general for a regression problem you hypothesize a model of the form $$y=f_\theta(x)+\epsilon$$ where $(x,y)$ is the observed data, $f_\theta$ is a function depending on unknown parameters $\theta$, and $\epsilon$ is an unknown pointwise error (with expected value zero). Generally the parameters $\theta$ are estimated by minimizing some function of the residuals, $r=y-f_\theta(x)$.

In the case of least squares, this is $E(r)=\overline{(r^2)}$, so the "RMSE" corresponds to an estimated standard deviation of the error term (computed over the sampled residuals). If the errors $\epsilon$ are normally distributed, then least squares is a maximum-likelihood estimator of the parameters.

In your case, the errors appear to have some outliers (around $t=0.5$), which will inflate the error-scale estimate (and possibly bias the parameter estimate). You could mitigate this by using a robust estimate for the residual-dispersion (either as a post-processing, or as part of a robust regression scheme). In any case, while a single-number summary is convenient, it is always good to also inspect the residual distribution (both vs. time, and the bulk PDF).

For parameter uncertainty, a simple approach would be bootstrapping. This will also have the benefit of showing the impact of any outliers (as these will be less likely to be included in bootstrap samples).

Related Solutions

Solved – Chi squared test for goodness of fit

Is the first version only true when the noise on the data is Poisson, and thus $σ^2_i=E_i$.

Not quite; for example it works for the multinomial (see Pearson 1900 ); $E_i$ is no longer the standard deviation but the dependence between cells in the multinomial exactly compensates for it; see also the test of independence.

Since the examples I found are always for counts in some categories, is the chi-squared test even the right one to measure the goodness of a fit for, e.g., a voltage vs. time signal

Under some very particular assumptions perhaps, including conditionally independent Gaussian response and known $\sigma_i$. I often see it applied where it clearly doesn't apply (e.g. where there's substantial observation error in the $x$'s and its a situation where an errors-in-variables model would apply, or where the supposedly-known $\sigma$ values are clearly inconsistent with the spread of the data around the fit).

To my recollection it doesn't actually apply to nonlinear models when estimating parameters (except approximately/asymptotically).

Solved – Goodness of fit measurement in Python

So first off, that isn't how the the chi-square function should be used.

If you want to know the "goodness of fit", use the R squared stat. R squared tells you how much of the observed variance in the outcome is explained by the input. Here is an example in python.

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
#Your data
np.random.seed(0)
N = 1000
x = np.random.normal(size = N)
y = 2*x+1 + np.random.normal(size = N)

#Your model
model=LinearRegression()
model.fit(x.reshape(-1,1),y)

#Your statsitic
r2_score(y, model.predict(x.reshape(-1,1)))

This returns 0.801, so 80.1% percent of the variance in y seems to be explained by x. The higher this number the better.

As for assessing fit over different regions, I think you can look at the residuals of your model and see if there are patterns or heterogeneity of variance.

Best Answer

Related Solutions

Solved – Chi squared test for goodness of fit

Solved – Goodness of fit measurement in Python

Related Question