Solved – Tobit for corner solution models: normality and homoskedasticity tests

hypothesis testingrtobit-regression

I would like to regress several explanatory variables (called X) on a dependent variable called Y where Y is strictly positive and continuous. It is left-censored at 0. Y represents how much a household is willing to pay to protect the environment in a specific area. It is cross-section data.

I asked R to run a Tobit I model using censReg package or AER package:

tobit <- tobit(Y ~ X, left=0, data = mydata)

tobit2 <- censReg(Y ~ X, left=0, data = mydata)

and then

plot(fitted(tobit), residuals(tobit))

When observing the residuals' plot, I see the distribution is not normal whereas it's necessary that the errors from this regression be normally distributed and homoskedastic.

Does anyone know how I can test these errors for normality and heteroskedastcity in R with the specific censored data issue? Using bptest() from the lmtest package – the Breuch Pagan test- does not work. I know that in some programs, like Stata, it's pretty straightforward to test residuals, but I have no idea how in R.

Thank you!

Mareen

Best Answer

The residuals from such tobit models will often look non-normal (typically right-skewed) due to the censoring. This makes it difficult to apply standard classic techiques such as graphics or diagnostic tests for linear regression models (like the Breusch-Pagan test).

One thing you can do, though, is to fit a heteroskedastic tobit model where the dependence of the variance on some covariates is included. And instead of using a latent normal distribution you could also employ a distribution with heavier tails such as the logistc or t distribution. All these approaches are available in the R package crch (for censored regression with conditional heteroskedasticity). So you can fit models, say with constant variance and with heteroskedasiticy, and then compare these models by means of information criteria (AIC, BIC, ...) or likelihood ratio tests etc.

Of course, there are further sources of misspecification problems, e.g., omitted regressors, misspecified functional form etc.

Related Solutions

Breusch-Pagan Test – Differences Between Two Types Explained

Your guess is correct, ncvTest performs the original version of Breusch-Pagan test. This can actually be verified by comparing it to bptest(model, studentize = FALSE). (As @Helix123 pointed out, two functions also differ in other aspects such as default arguments, one should check package manuals of lmtest and car for more detail.)

The studentized Breusch-Pagan test was proposed by R. Koenker in his 1981 article A Note on Studentizing a Test for Heteroscedasticity. The most obvious difference of the two is that they use different test statistics. Namely, let $\xi^\ast$ be the studentized test statistics and $\hat{\xi}$ be the original one, $$\newcommand{\Var}{\operatorname{Var}}\hat{\xi}=\lambda\xi^\ast,\qquad\lambda=\frac{\Var(\varepsilon^2)}{2\Var(\varepsilon)^2}.$$

Here is a snippet of code that demonstrates what I just wrote (data taken from faraway package):

> mdl = lm(final ~ midterm, data = stat500)
> bptest(mdl)

    studentized Breusch-Pagan test

data:  mdl
BP = 0.86813, df = 1, p-value = 0.3515

> bptest(mdl, studentize = FALSE)

    Breusch-Pagan test

data:  mdl
BP = 0.67017, df = 1, p-value = 0.413

> ncvTest(mdl)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.6701721    Df = 1     p = 0.4129916 
> 
> n = nrow(stat500)
> e = residuals(mdl)
> bpmdl = lm(e^2 ~ midterm, data = stat500)
> lambda = (n - 1) / n * var(e^2) / (2 * ((n - 1) / n * var(e))^2)
> Studentized_bp = n * summary(bpmdl)$r.squared
> Original_bp = Studentized_bp * lambda
> 
> Studentized_bp
[1] 0.8681335
> Original_bp
[1] 0.6701721

As for why one wants to studentize the original BP test, a direct quote from R. Koenker's article may be helpful:

... Two conclusions emerge from this analysis:

The asymptotic power of the Breusch and Pagan test is extremely sensitive to the kurtosis of the distribution of $\varepsilon$, and

the asymptotic size of the test is correct only in special case of Gaussian kurtosis.

The former conclusion is expanded upon in Koenker and Bassett (1981) where alternative, robust tests for heteroscedasticity are suggested. The latter conclusion implies that the significance levels suggested by Breusch and Pagan will be correct only under Gaussian conditions on $\varepsilon$. Since such conditions are generally assumed on blind faith and are notoriously difficult to verify, a modification of the Breusch and Pagan test is suggested which correctly "studentise" the test statistic and leads to asymptotically correct significance levels for a reasonably large class of distributions for $\varepsilon$.

In short, the studentized BP test is more robust than the original one.

Solved – Exact difference between two-part models (e.g., Cragg) and Tobit type 2 models (e.g., Heckman)

Thanks for asking, Mark. In context of my data I ended up using the double hurdle model proposed by Blundell (the first bullet of my suggested solutions). Based on the feedback I received on academic conferences this seems to be a viable approach. I also ended up using the R-package mhurdle. Weights simply do not work - the rest of the code seems to be very solid.

Regarding my specific questions; I do not have a finite answer to all of them, but let me try to summarise what I learnt:

Is my statement about the three models correct? It appears so - yes

Are the sources of zeros the only/main decision criteria? They are certainly not the only decision criteria, but in context of data with a mass point at zero, spending significant time on understanding how the zeros are generated is tremendously important.

What are the key decision criteria I should consider/discuss when deciding about what type of model to use? Besides the obvious questions regarding type of dependent variable and its distribution, the two main questions involving data with a mass point at zero are: Do you want to distinguish your results by the two different stages or is it sufficient to report one set of coeffcients? If so you may use a Tobit model; otherwise you need a two-part model where the discussion about the different sources of zeros comes into play.

Is there more than 'just' the source of the zeros? Yep - there is. At least two: observed/true zeros and unobserved/false zeros (the latter actually being either NAs or so small values that are recoded as 0)

Hope this helps you a bit! Jan