Solved – Interpretation of weights in non-linear least squares regression

least squaresnonlinear regressionreferencesscipyweighted-regression

I am conducting a non-linear least squares regression fit using the python scipy.optimize.curve_fit function, and am trying to better understand the weights that go into this method.

I have a distribution of raw data points that I wish to fit to a Gaussian cumulative distribution function. I created a function for this that takes three parameters: an average, a standard deviation, and a scale factor for if my distribution doesn't quite approach 1.

My confidence in each raw data point is based on a separate instrumental count that doesn't necessarily have anything to do with the value of the data point, so I'm trying to include this in my fit using weights. Specifically, small and large values of x have less certainty, so I want them to matter less in the regression. When I conduct a fit by passing these raw counts as weights, the fit is not particularly good, whereas if I pass them as (1/counts) the fit improves. I have plotted the raw data, fits, and normalized weights for these two options.

What I am trying to understand is how to interpret the weights. I would have thought that higher values for the weights imply more importance in the regression. However, it seems that it is actually correct to use the weights where the "bad" raw data points have a higher weight. Why is this and how should I be interpreting the weights? Also, is there a better resource for understanding weights in non-linear regression? Most of the sources I have found have not explained weighting in a way I can comprehend.

Edit: I added a second plot that now shows actual (non-normalized) counts, along with the corrected fit (weighted according to counts) according the proper fitting technique shown below.

enter image description here

enter image description here

Best Answer

The weights should equal the counts, because those will be inversely proportional to the variances of the errors. Specifically, the model for the data $(x_i, y_i, n_i)$ is

$$y_i \sim \lambda \Phi((\log(x_i) - \mu)/\sigma + \varepsilon_i$$

with $\mu, \sigma \gt 0,$ and $\lambda \gt 0$ the parameters and $\varepsilon_i$ are independent random variables with zero means and variances

$$\text{Var}(\varepsilon(i)) = \sigma^2 / n_i$$

where $n_i$ are the counts.

The fit to the logarithm of $x$ is visually ok:

Figure

In this figure the x-axis is on a logarithmic scale, the point symbols have areas proportional to the counts (so that large circles will have more influence in the fitting than small ones), and the red line is a least-squares fit. It is clear the model is not really appropriate: the residuals for smaller values of $y$ tend to be small, regardless of the counts. Possibly the sum of squares of relative errors should be minimized to obtain an appropriate fit.

It is evident that the fit is poor for the largest $x$, but those also have small counts.


The R code with (my version of) the data and the fitting and plotting procedures follows.

y <- c(1, 1, 2, 1, 2, 1, 3, 4, 22, 30, 44, 58, 68, 69, 
       71, 72, 75, 72, 80, 78, 87, 86, 80, 82, 92, 90, 85, 61, 38, 36) / 100
x <- ceiling(exp(seq(log(20), log(500), length.out=length(y))))
counts <- c( 10, 3, 17, 20, 38, 31, 44, 55, 58, 68, 77, 
             82, 86, 82, 77, 75, 70, 65, 68, 51, 47, 41, 38, 30, 22, 14, 9, 4, 2, 1)
#
# The least-squares criterion.
# theta[1] is a location, theta[2] an x-scale, and theta[3] a y-scale.
#
f <- function(theta, x=x, y=y, n=counts) 
  sum(n * (y - pnorm(x, theta[1], theta[2]) * theta[3])^2) / sum(n)
#
# Perform a count-weighted least-squares fit.
#
xi = log(x)
fit <- optim(c(median(xi), sd(xi), max(y) * sd(xi)), f, x=xi, y=y, n=counts)
#
# Plot the result.
#
par(mfrow=c(1,1))
plot(x, y, log="x", xlog=TRUE, pch=19, col="Gray", cex=sqrt(counts/12))
points(x, y, cex=sqrt(counts/10))
curve(fit$par[3] * pnorm(log(x), fit$par[1], fit$par[2]), 
          from=10, to=1000, col="Red", add=TRUE)