Nonlinear Regression – Consequences of Violating Assumptions When Comparing Models

assumptionsheteroscedasticitymodel comparisonnonlinear regressionnormality-assumption

I have a question about the consequences of using non-linear regression when the data violate the assumptions of (1) homoscedasticity and (2) normal distribution. Specifically, I am wondering about how it affects model comparison and the comparison of two data sets with one model. That is, how does it affect the validity and/or strength of those comparisons?

To give the question a little context, there are many instances in agronomy (my particular field) where the data are always heteroscedastic and non-normal–especially when they approach some sort of natural limit. It is nevertheless very common to see nonlinear regressions fit through these kind of data and then see subsequent model and/or data set comparison.

Here is an imaginary example where weed biomass influences crop yield loss.
Imaginary example

Issues of heteroscedasticity. The uncertainty is clearly and regularly related to the predictor value. As we increase weed biomass there is much more certainty in the crop yield loss values than there is at lower levels of weed biomass.

Issues of non-normality. Additionally, at the higher levels of weed biomass, where the crop yield loss approaches but cannot exceed 100%, the data is not distributed normally. Rather, it's something of a truncated normal distribution, where half of the distribution is missing (in this case, the half that would go above 100%).

So, what does this mean for further analysis in terms of model and/or data set comparison?

The former is often done in agronomy to determine the better model for prediction purposes (e.g., which model is better for predicting crop response?). The latter is often done to evaluate the effect of certain treatments on a relationship (e.g., does fertilization affect the relationship between weed biomass and crop yield loss?).

Can we proceed with these kinds of comparisons despite the data inherently and always violating two of the major assumptions of nonlinear regression? If not, why not and what might be an alternative approach to answering the aforementioned kind of research questions?

Thanks so much,

Angela

Just for reference, here is the code used to generate the figure.

### Simulated data set (idealized) ###
set.seed(123)    
density<-as.numeric(seq(0,1000,10)) # hypothetical range of predictor variables
error <- rnorm(n = length(density), mean = 0, sd = 20) # normally distributed errors
A = 100 # asymptote parameter = 100%
I = 1 # slope parameter = 1
YL = I*density/(1+(I/A)*density) + error # yield loss + random error
plot(density, YL) # idealized but unrealistic data for this kind of study

### Modified data set (more realistic) ###
# That is ...
# (1) a maximum yield loss of 100% and 
# (2) decreased variability at higher density values
temp.logical <- density >= 500 & YL <= 95
new.error <- rnorm(n = length(temp.logical[temp.logical==TRUE]), mean = 0, sd = 5)
realistic.YL<-ifelse(density >= 500 & YL <= 100, 100 - abs(new.error),
                     ifelse(YL >= 100, 100, YL))
plot(density, realistic.YL, 
     xlab = expression(Weed ~ biomass ~ (g ~ m^{-2})),
     ylab = "Crop yield loss (%)") # more realistic dataset

### Fitting a rectangular hyperbola to the modified data set ###
mod.1 <- nls2(realistic.YL ~ I*density/(1+(I/A)*density),
         start = list (I= 1, A=100),
         trace = T)
summary(mod.1)     
I<-summary(mod.1)$coefficients[1,1]
A<-summary(mod.1)$coefficients[2,1]
pred.YL = I*density/(1+(I/A)*density)
lines(pred.YL~dens)

Best Answer

The normality assumption is not necessary for nonlinear regression. It is often used because it's convenient. However, if it's clearly violated then I wouldn't use such an assumption at all. The same goes for homoscedasticity.

In your example the dependent variable seems to be confined between 0 and 100%. You could still use normal distributions and homoscedasticity if the data were "far" from the bounds. However, you show the sample where data spans all range, with substantial portion clustered by the borders. In this case neither homoscedasticity nor normality seems like reasonable assumptions.

Related Question