Solved – Compare non-linear model parameter estimates between conditions

nonlinear regressionrstatistical significance

What is the appropriate way to test for significant differences between the same parameter estimate from 2 nonlinear models? An example using R – here are 2 datasets:

library(tidyverse)
#example from ?nls
DNase1 <- subset(DNase, Run == 1)
DNase2 <- subset(DNase, Run == 2)

Both datasets can be fit with a nonlinear function using the nls() function and coefficients extracted:

   ## fit models and extract coefficients
m1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
m1_coef <- tidy(m1) %>% 
  mutate(Run = 1)

m2 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase2)
m2_coef <- tidy(m2) %>% 
  mutate(Run = 2)

pars <- rbind(m1_coef, m2_coef) %>% 
  dplyr::filter(term == "Asym")

print(pars)

For simplicity, some of the results include 2 estimates of the 'Asym' parameter, one estimate for each condition (Run 1 & 2) made by each of the 2 models:

     term Estimate Std. Error  t value     Pr(>|t|) Run
1    Asym 2.345182  0.0781541 30.00715 2.165539e-13   1
2    Asym 2.595948  0.0646589 40.14835 5.109901e-15   2

Is there a way test if the estimate for 'Asym' from Run 2 (2.345) is significantly different than the estimate from Run 1 (2.596)?

Best Answer

Create a model m12 consisting of separate parameters for each run and a model m0 where the parameters are the same for each run and then compare those two models using an F test. In R that would be done like this:

# m1 and m2 will be used to set starting values for m12 and a0
fo <- density ~ SSlogis(log(conc), Asym, xmid, scal)
m1 <- nls(fo, DNase, subset = Run == 1)
m2 <- nls(fo, DNase, subset = Run == 2)

Logis <- function(x, Asym, xmid, scal) Asym / (1 + exp((xmid - x)/ scal))

# Run 1 and Run 2 each have a separate set of parameters.
# (Run is a factor whose labels don't correspond to its levels so make it numeric.)
m12 <- nls(density ~ Logis(log(conc), Asym[Run], xmid[Run], scal[Run]),
  transform(DNase, Run = as.numeric(as.character(Run))), 
  subset = Run %in% 1:2, 
  start = as.data.frame(rbind(coef(m1), coef(m2))))

# Run 1 and Run 2 have the same set of parameters
m0 <- nls(fo, DNase, subset = Run %in% 1:2)

anova(m0, m12)

giving:

Analysis of Variance Table

Model 1: density ~ SSlogis(log(conc), Asym, xmid, scal)
Model 2: density ~ Logis(log(conc), Asym[Run], xmid[Run], scal[Run])
  Res.Df Res.Sum Sq Df   Sum Sq F value    Pr(>F)    
1     29   0.096915                                  
2     26   0.008478  3 0.088437  90.408 7.044e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If we want to test only whether Asym differs but not whether xmid and scal are the same then create a model a0 where Asym is the same but the other parameters can differ and compare it to m12.

# same Asym for Run 1 and Run 2 but other parameters separate
a0 <- nls(density ~ Logis(log(conc), Asym, xmid[Run], scal[Run]),
  transform(DNase, Run = as.numeric(as.character(Run))), 
  subset = Run %in% 1:2, 
  start = c(Asym = (coef(m1)[[1]] + coef(m2)[[1]])/2, 
    as.data.frame(rbind(coef(m1)[-1], coef(m2)[-1]))))

anova(a0, m12)

giving:

Analysis of Variance Table

Model 1: density ~ Logis(log(conc), Asym, xmid[Run], scal[Run])
Model 2: density ~ Logis(log(conc), Asym[Run], xmid[Run], scal[Run])
  Res.Df Res.Sum Sq Df    Sum Sq F value  Pr(>F)  
1     27  0.0103246                               
2     26  0.0084778  1 0.0018468  5.6639 0.02494 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Related Solutions

Solved – Deciding between a linear regression model or non-linear regression model

This is a realm of statistics called model selection. A lot of research is done in this area and there's no definitive and easy answer.

Let's assume you have $X_1, X_2$, and $X_3$ and you want to know if you should include an $X_3^2$ term in the model. In a situation like this your more parsimonious model is nested in your more complex model. In other words, the variables $X_1, X_2$, and $X_3$ (parsimonious model) are a subset of the variables $X_1, X_2, X_3$, and $X_3^2$ (complex model). In model building you have (at least) one of the following two main goals:

Explain the data: you are trying to understand how some set of variables affect your response variable or you are interested in how $X_1$ effects $Y$ while controlling for the effects of $X_2,...X_p$
Predict $Y$: you want to accurately predict $Y$, without caring about what or how many variables are in your model

If your goal is number 1, then I recommend the Likelihood Ratio Test (LRT). LRT is used when you have nested models and you want to know "are the data significantly more likely to come from the complex model than the parsimonous model?". This will give you insight into which model better explains the relationship between your data.

If your goal is number 2, then I recommend some sort of cross-validation (CV) technique ($k$-fold CV, leave-one-out CV, test-training CV) depending on the size of your data. In summary, these methods build a model on a subset of your data and predict the results on the remaining data. Pick the model that does the best job predicting on the remaining data according to cross-validation.

Solved – Why the significance of terms in orthogonal polynomial regression changes with the degree of the regression

The answer by @user20637 is mostly correct. However, the problem here is not that there is a lack of fit in the 3rd-order polynomial model, nor is the 4th-order polynomial necessarily a better fit. In fact, the particular values of the residuals doesn't matter to the question of why the p-values go down when you add a new term to the model.

Your p-values are calculated using t-statistics, which are computed by dividing the Estimate column by the Std. Error column in the results tables. Since you're using orthogonal polynomials, the estimated values for polynomial terms 0-3 don't change when you add the 4th term.

But the standard errors are proportional to "residual sum of squares divided by residual degrees of freedom". Adding an additional term to your regression model will always cause the residual sum of squares to go down. It also decreases the residual degrees of freedom by one (from 10 to 9). In your case,

RSS(4)/9 < RSS(3)/10,

where RSS(4) is the residual sum of squares for the 4th-order model and RSS(3) is the residual sum of squares for the 3rd-order model. Therefore (and since the proportionality constant doesn't change), adding the 4th polynomial term decreases the standard error for the previous coefficient estimates. This makes them appear more statistically significant.

However, coefficient p-values cannot be used to compare the two models because each p-value here has a specific interpretation that is in terms of comparing the complete model with that variable to the model where that variable is dropped. This is why the p-value for the 4th polynomial term in your second table is identical to the p-value of the 4th-order model in your ANOVA table. I mean that the ANOVA is comparing your 3rd-order model to your 4th-order model, and this is exactly what the p-value for poly(x, 4)4 in your second table is doing, too.

That's all well and good for the 4th term, but the p-value of poly(x, 2)4 is comparing a model with 0th, 1st, 3rd, and 4th order orthogonal polynomial terms to a model with all 0th - 4th terms. And even worse, you have to decide in advance that that is the only comparison you want to make, because once you look at a second p-value (or third or...) your comparisons no longer have the p=0.05 level for statistical significance. In fact, you cannot know what the true level of the test is without some very advanced stats.

Best Answer

Related Solutions

Solved – Deciding between a linear regression model or non-linear regression model

Solved – Why the significance of terms in orthogonal polynomial regression changes with the degree of the regression

Related Question