Solved – Fraction of variance unexplained and R-squared in linear and non-linear regression

linear modelnonlinear regressionrregression

I have a non-linear model of the following form:

$y = a*x^b$

I can fit it using logarithms and a linear model or directly with a non-linear model.

First approach, logarithms and linear model:

lmfit <- lm(log(y)~log(x))

Second approach, non-linear model:

nlsfit <- nls(y~a*x^b, start=list(a=200, b=1.6))

In the first case I can simply get the $R^2$ value from the linear model or calculate it myself by:

rsquared <- var(fitted(lmfit)) / var(log(y))

In the second case there is no $R^2$ value generated, but I can obtain one $pseudoR^2$ value myself by:

pseudorsquared <- var(fitted(nlsfit)) / var(y)

In a linear model I can calculate the fraction of variance unexplained by simply doing $1-R^2$. I have read that this is not applicable to non-linear regressions. I would like to know if there is an equivalent version of this measure, so that I can compare both regressions and use the best one.

As an extra information, I would like to add that this is a regression of physical variables, and that the non-linear approach is providing more close-to-literature results for the coeficients, whereas the linear approach gives better statistical performance ($R^2$, bias, etc.).

Best Answer

What needs greater exposure here is that the different methods have quite different assumptions about error structure so that from one (perhaps conservative) viewpoint the best method is that which makes the most accurate assumptions about functional form and error structure and it follows that any other method is just not so good.

You have not mentioned a third way which is quite easy to implement, to use generalized linear models for y as a function of log x, with log link. Someone else will be easily be able to give R code for that if it is not evident.

Or a fourth way (which in turn can be done differently): to assume that both y and x are subject to error.

In terms of $R^2$ measures, my own preference here is to regard them as variations on squaring corr(observed $y$, predicted $y$), the variations coming in how $y$ is predicted. There remains a difference between any fitting procedure that can be presented as (directly equivalent to) maximizing such an $R^2$ and any where it is a descriptive figure of merit calculated post hoc.

But what do you really seek in such measures? It seems to me more fruitful to look at the structure of residuals using a graphical approach. On this criterion the best model leaves least structure in the residuals.

The power function appears as a favourite model in several literatures that don't overlap very much, although the many of the same problems and many of the same solutions have been rediscovered in different fields. Less satisfactorily, systematic vagueness on how parameters were estimated is also common in various disciplines.

Consistency with literature estimates might well depend on consistency with the dominant method in your field.

Related Solutions

Solved – Putting limits on estimated coefficient values

This is simply changing OLS problem to NLS one. Let us say we have the model

$$y_i=\alpha_0+\alpha_1x_i+\varepsilon_i$$

Say we want to make $\alpha_1$ positive, then rewrite this equation as

$$y_i=\alpha_0+\exp(\beta_1)x_i+\varepsilon_i$$

Algebraically this is the same, we only have different reparametrisation. Instead of $(\alpha_0,\alpha_1)\in \mathbb{R}\times\mathbb{R}_+$ we have now $(\alpha_0,\beta_1)\in\mathbb{R}^2$ and we can estimate the second equation using non-linear least squares.

Here is how we can achieve this in R:

> x<-rnorm(100)
> y<-1+2*x+rnorm(100)/5
> lm(y~x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.9794       1.9895  

> nls(y~a+exp(b)*x,start=list(a=0.9,b=log(2)-0.3))
Nonlinear regression model
  model:  y ~ a + exp(b) * x 
   data:  parent.frame() 
     a      b 
0.9794 0.6879 
 residual sum-of-squares: 3.694

Number of iterations to convergence: 3 
Achieved convergence tolerance: 6.09e-06 
> exp(0.6879)
[1] 1.989533

Of course since you convert OLS problem to NLS you need to supply starting values. In Eviews it is done automatically. Note however that Eviews may happily blurt out the answer even if it is not correct, i.e. convergence was not achieved. This at least was my personal experience in working with it.

Update The last example in the link is not correct. Suppose you have the constraints $\alpha_0+\alpha_1=1$ and $\alpha_0,\alpha_1>0$, then the problem actually has only one parameter:

$$y=\alpha+(1-\alpha)x+\varepsilon$$

Then you can use reparametrisation:

$$\alpha=\frac{\beta}{1+\exp(\beta)}$$

Here is the implementation in R:

> y<-0.3+0.7*x+rnorm(100)/5
> lm(y~x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     0.3175       0.7402  

> nls(y~exp(a)/(exp(a)+1)+(1/(exp(a)+1))*x,start=list(a=-0.5))
Nonlinear regression model
  model:  y ~ exp(a)/(exp(a) + 1) + (1/(exp(a) + 1)) * x 
   data:  parent.frame() 
      a 
-0.8876 
 residual sum-of-squares: 4.073

Number of iterations to convergence: 3 
Achieved convergence tolerance: 4.069e-09 
> exp(-0.8876)
[1] 0.4116425

Note however the discrepancy between the generated value and the resulting coefficient. Also if you will try to use the reparametrisation suggested in the link:

$$\alpha_0=\frac{\exp(\beta_0)}{\exp(\beta_0)+\exp(\beta_1)}$$ $$\alpha_1=\frac{\exp(\beta_1)}{\exp(\beta_0)+\exp(\beta_1)}$$

you will have an ill posed NLS problem and nls will complain about singular gradients.

Solved – Pseudo-$R^2$: what are the null models for linear and non-linear regressions

Perhaps I'm misunderstanding the situation, but I suspect much needs to be rethought here. First, metrics like McFadden's pseudo-$R^2$ are typically used with generalized linear models, so I'm guessing that's what you mean by "non-linear regression". But the choice between linear regression and a GLM (like logistic regression), is not based on model fit, but statistical theory. If your response variable is continuous and approximately normally distributed, then OLS (i.e., standard) regression is in order. On the other hand, if your response variable is the outcome of a Bernoulli trial (e.g., a success or failure), then logistic regression is appropriate. There are a variety of other GLMs, as well. Other GLMs include: multinomial logistic regression (when the response is one of a set of mutually-exclusive categories), Poisson regression (for counts), or beta regression (for continuous proportions, which may be appropriate here). In any case, the model is selected based on its appropriateness to the situation, not based on fit. A good introduction to these issues is Agresti's Introduction to Categorical Data Analysis, although it doesn't cover beta regression. This paper provides a basic introduction to BR and how it can be applied in R. Thus, my first point is that you don't compare linear and non-linear regression models in the manner you appear to be thinking of.

In addition, you should know that many people think that pseudo-$R^2$'s are not very good measures of model quality, and that this can even be true of regular $R^2$. It may be worth your while to read up on these issues and get a better sense of what they are and their pros and cons before you use them. Here is a good overview of pseudo-$R^2$. Here is a good discussion of their merits on CV, and here is another CV discussion of the merits of regular $R^2$. Moreover, statisticians tend to think these metrics should not be used for model comparison, but that information criteria, like the AIC, should be used instead.

Best Answer

Related Solutions

Solved – Putting limits on estimated coefficient values

Solved – Pseudo-$R^2$: what are the null models for linear and non-linear regressions

Related Question