Linear Regression – Best Polynomial or Alternative Approaches in Linear Regression

curve fittingnonlinear regressionregression

Any ideas on other polynomials I could successfully use for applying regression? My goal is a solution that has fit error strictly based on the noise. Is this possible since it is a bell-like curve? The tails can be long, short or non-existent. Am I looking for the impossible? Please note it has been a while since I have plunged this far into linear regression.

Anyway, my data points do not fit the $ax+bx^2+c$ polynomial well enough. I would like a replacement for this polynomial to apply regression (or different approach). In the following example, each point misses the fit curve by $20\%$ of the $Y$ range on average (of the data points). I looked at splines, but I do not know how to apply the regression since they look piece-wise.

Example $Y$ values (evenly spaced in $X$). It looks kind of like a bell curve. I would be applying 5 to 50 data points to determine the polynomial coefficients. Ultimately, I am interested in the best $X$ location for the peak based on the data points.

 X      Y
-40    -21142.1111111111
-30    -21330.1111111111
-20    -12036.1111111111
-10      7255.3888888889
  0     32474.8888888889
 10     32474.8888888889
 20      9060.8888888889
 30    -11628.1111111111
 40    -15129.6111111111

Best Answer

Because brightness is a response with independent random error and it is expected to taper off with distance from the optimal point according to a Gaussian function, a quick nonlinear regression ought to do a good job.

The model is

$$y = b + a \exp\left(-\frac{1}{2}\left(\frac{x-m}{s}\right)^2\right) + \varepsilon$$

where $\varepsilon$ represents the errors in measuring the brightness, modeled here as random quantities. The peak occurs at $m$; $s\gt 0$ quantifies the rate at which the curve tapers off; $a\gt 0$ reflects the overall magnitudes of the relative $y$ values, and $b$ is a baseline.

Let's try it with the sample data (using R). By including the middle ($m$) among the parameters, the software will automatically output its estimate and a standard error for it:

y <- c(-190279, -191971, -108325, 65298, 292274, 292274, 81548, -104653, -136166)/9
x <- (-4:4)*10
#
# Define a Gaussian function (of four parameters).
f <- function(x, theta)  { 
  m <- theta[1]; s <- theta[2]; a <- theta[3]; b <- theta[4];
  a*exp(-0.5*((x-m)/s)^2) + b
}
#
# Estimate some starting values.
m.0 <- x[which.max(y)]; s.0 <- (max(x)-min(x))/4; b.0 <- min(y); a.0 <- (max(y)-min(y))
#
# Do the fit.  (It takes no time at all.)
fit <- nls(y ~ f(x,c(m,s,a,b)), data.frame(x,y), start=list(m=m.0, s=s.0, a=a.0, b=b.0))
#
# Display the estimated location of the peak and its SE.
summary(fit)$parameters["m", 1:2]

The tricky part with nonlinear fits usually is finding good starting values for the parameters: this code shows one (crude) approach. Its output,

  Estimate Std. Error 
 5.3161940  0.4303487 

gives the estimate of the peak ($5.32$) and its standard error ($0.43$). It's always a good idea to plot the fit and compare it to the data:

par(mfrow=c(1,1))
plot(c(x,0),c(y,f(coef(fit)["m"],coef(fit))), main="Data", type="n",
     xlab="x", ylab="Brightness")
curve(f(x, coef(fit)), add=TRUE, col="Red", lwd=2)
points(x,y, pch=19)

Data and fit

That's what we expected: the data appear to fit the Gaussian pretty well. For a more incisive look at the fit, plot the residuals:

plot(x, resid(fit), main="Residuals")

Residual plot

You want to check that most residuals are as small as the (known?) variation in the brightness measurement and that there is no important trend or pattern in them. We might be a little concerned about the high residual at $x=40$, but rerunning the procedure with this last data point removed does not appreciably change the estimate of $m$ (which is now $5.25$ with a standard error of $0.17$, not distinguishable from the previous estimate). The new residuals bounce up and down, tend to get smaller as $x$ gets larger, but otherwise tend to be less than $1000$ or so in absolute value: there's no sign here that more effort is needed to pin down $m$.