Solved – How to correct for non-linearity of response in linear regression

heteroscedasticitymultiple regressionregression

I want to train a linear regression model to predict a non-linear variable. This how the two independent variables correlated against the response (points are jittered):

enter image description here

And the residuals against the fitted values:

enter image description here

Most of the values for the response are zero. The effect is a very strong heteroscedasticity

        studentized Breusch-Pagan test

data:  model
BP = 55483.84, df = 2, p-value < 2.2e-16

event though the the predictors are strongly correlated with the response

Call:
lm(formula = response ~ predictor1 + predictor2, data = train_predictors)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6996 -0.0268 -0.0238 -0.0182  4.8785 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.748e-02  2.825e-04   97.28   <2e-16 ***
predictor1   8.491e-05  6.574e-07  129.16   <2e-16 ***
predictor2  -3.934e-10  8.298e-12  -47.41   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1561 on 498498 degrees of freedom
Multiple R-squared:  0.0365,    Adjusted R-squared:  0.0365 
F-statistic:  9442 on 2 and 498498 DF,  p-value: < 2.2e-16

Should I consider more adopting non-linear models or could I first try correcting the non-linearity of the response?

Best Answer

I don't know details of your model, but in my opinion you need to deal with the large amount of "zero responses". Look into compound models with a mass point at zero. Something like the "Tweedie model".

Related Solutions

Solved – Why is the Breusch-Pagan test significant on simulated data designed not to be heteroscedastic

No, the data are not heteroscedastic (by way of how you simulated them). Did you notice the 0 degrees of freedom of the test? That is a hint that something is going wrong here. The B-P test takes the squared residuals from the model and tests whether the predictors in the model (or any other predictors you specify) can account for substantial amounts of variability in these values. Since you only have the intercept in the model, it cannot account for any variability by definition.

Take a look at: http://en.wikipedia.org/wiki/Breusch-Pagan_test

Also, make sure you read help(bptest). That should help to clarify things.

One thing that is going wrong here is that the bptest() function apparently does not test for this errant case and happens to throw out a tiny p-value. In fact, if you look carefully at the code underlying the bptest() function, essentially this is happening:

format.pval(pchisq(0,0), digits=4)

which gives "< 2.2e-16". So, pchisq(0,0) returns 0 and that is turned into "< 2.2e-16" by format.pval(). In a way, that is all correct, but it would probably help to test for zero dfs in bptest() to avoid this sort of confusion.

EDIT

There is still lots of confusion concerning this question. Maybe it helps to really show what the B-P test actually does. Here is an example. First, let's simulate some data that are homoscedastic. Then we fit a regression model with two predictors. And then we carry out the B-P test with the bptest() function.

library(lmtest)
n <- 100    
x1i <- rnorm(n)
x2i <- rnorm(n)
yi  <- rnorm(n)
mod <- lm(yi ~ x1i + x2i)
bptest(mod)

So, what is really happening? First, take the squared residuals based on the regression model. Then take $n \times R^2$ when regressing these squared residuals on the predictors that were included in the original model (note that the bptest() function uses the same predictors as in the original model, but one can also use other predictors here if one suspects that the heteroscedasticity is a function of other variables). That is the test statistic for the B-P test. Under the null hypothesis of homoscedasticity, this test statistic follows a chi-square distribution with degrees of freedom equal to the number of predictors used in the test (not counting the intercept). So, let's see if we can get the same results:

e2 <- resid(mod)^2
bp <- summary(lm(e2 ~ x1i + x2i))$r.squared * n
bp
pchisq(bp, df=2, lower.tail=FALSE)

Yep, that works. By chance, the test above may turn out to be significant (which is a Type I error since the data simulated are homoscedastic), but in most cases it will be non-significant.

Solved – Comparing two linear regression models

If you set up the data in one long column with A and B as a new column, you then can run your regression model as a GLM with a continuous time variable and a nominal "experiment" variable (A, B). The output of the ANOVA will give you the significance of the difference between the parameters. "intercept' is the common intercept and the "experiment" factor will reflect differences between the intercepts (actually overall means) between the experiments. the "Time" factor will be the common slope, and the interaction is the difference between the experiments with respect to the slope.

I have to admit I cheat (?) and run the models separately first to get the two sets of parameters and their errors and then run the combined model to acquire the differences between the treatments (in your case A and B)...

Best Answer

Related Solutions

Solved – Why is the Breusch-Pagan test significant on simulated data designed not to be heteroscedastic

Solved – Comparing two linear regression models

Related Question