R Regression – A Comprehensive Guide on Censored Regression in R

rregressiontobit-regression

I have a Tobit model of the form:
\begin{align}
Y^*_i &= X_i\beta + \epsilon_i \\
Y_i &= \max(Y^*_i,0).
\end{align}

The regressors are one continuous variable and 30 factors (modelling seasonal effects during the day). In my sample there are among $3116$ observations $194$ left-censored and $2922$ uncensored.

I tried the R packages AER and censReg. With tobit() in AER I get an error message:

Error in solve.default(L %*% V %*% t(L)) : 
Lapack routine dgesv: system is exactly singular: U[2,2] = 0

Thus a first question: do I have too many factors? However AER has a predict function. Given the data $Y_i$ (non-negative) I would like to make a prediction of the censored data – $E[Y|Y>0,X_i]$ – is this what the predict function of AER gives me?

How can I predict in the package censReg?

EDIT 1) as Achim Zeileis said, there is a data problem with my factors, they are 0 or NAN too often and the tobit algorithm cannot find a solution. I have to fix the data.

EDIT 2) Please look at the following example:

library(AER)
N = 10
f = rep(c("s1","s2","s3","s4","s5","s6","s7","s8"),N)
fcoeff = rep(c(-1,-2,-3,-4,-3,-5,-10,-5),N)
set.seed(100) 
x = rnorm(8*N)+1
beta = 5
epsilon = rnorm(8*N,sd = sqrt(1/5))
y.star = x*beta+fcoeff+epsilon ## latent response
y = y.star 
y[y<0] <- 0 ## censored response

fit <- tobit(y~0+x+f)
summary(fit)

coef(fit) # very satisfying estimates

my.range = range(y, y.star, predict(fit))

plot(y, ylim=my.range)
lines(predict(fit), col="red")
lines(y.star, col="blue")

As Achim Zeileis writes predict() predicts $Y^*$ but how can we predict $Y$ efficiently and in an unbiased fashion?

Best Answer

The predict() and fitted() methods for tobit objects compute the estimates for the latent mean $\mu = E[y^*] = x^\top \beta$. Additionally, the scale parameter $\sigma$ is available as $scale in the objects:

mu <- fitted(fit)
sigma <- fit$scale

The probability of a non-zero observation is then $P(y > 0) = \Phi(\mu/\sigma)$, i.e.:

p0 <- pnorm(mu/sigma)

The conditional expectation of the censored $y$ given that it is non-zero is $E[y | y > 0] = \mu + \sigma \cdot \lambda(\mu/\sigma)$, where $\lambda(\cdot)$ is the inverse Mills ratio $\lambda(x) = \phi(x)/\Phi(x)$:

lambda <- function(x) dnorm(x)/pnorm(x)
ey0 <- mu + sigma * lambda(mu/sigma)

Finally, the unconditional expectation is $E[y] = P(y > 0) \cdot E[y | y > 0]$, i.e.:

ey <- p0 * ey0

If you want to visualize everything together in a time series style plot:

plot(y, ylim = my.range)
lines(mu, col = "slategray")
lines(y.star, col = "black")
lines(ey0, col = "green")
lines(ey, col = "blue")

The reason that the predict() method for tobit objects does not provide all of this automatically is that for all the distributions other than the normal / Gaussian, the relationship is not that easy. But maybe we should at least support the normal case.

Related Solutions

Solved – How should I handle a left censored predictor variable in multiple regression

One option is to include a variable that is 1 if symptom severity was not measured and 0 otherwise, then code all the symptom severities that were not measured as 0. The coefficient on the 0/1 variable will represent the average test score for those that did not have the severity measured and the slope for the severity will be computed based on those that had severity measures.

Tobit Regression – How to Explain a Censored Regression Model

By default, the estimated standard deviation of the residuals ($\sigma$) is returned as $\ln(\sigma)$ since that is how the Tobit log likelihood maximization is performed. If you use coef(estResult,logSigma = FALSE), you will get $\sigma$ instead, which is analogous to the square root of the residual variance in OLS regression. That value can be compared to the standard deviation of affairs. If it is much smaller, you may have a reasonably good model. Or you can do the exponentiation yourself with a calculator and use delta method for the variance. You will also need $\sigma$ to construct some of the marginal effects.

I don't think the hypothesis test about $\ln \sigma$ and the corresponding p-value have a clear interpretation, whereas the other coefficients can be interpreted as the marginal effects on the uncensored outcome, so the p-value on the null that the ME is zero makes sense for them. I believe R is just treating $\ln \sigma$ as another parameter.

Here's my replication of your analysis in Stata (where I am also treating the categorical variables as continuous) confirming what I wrote above.

First we load the affairs data:

. ssc install bcuse
checking bcuse consistency and verifying not already installed...
all files already exist and are up to date.

. bcuse affairs

Contains data from http://fmwww.bc.edu/ec-p/data/wooldridge/affairs.dta
  obs:           601                          
 vars:            19                          22 May 2002 11:49
 size:        15,626                          
-------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------
id              int     %9.0g                 identifier
male            byte    %9.0g                 =1 if male
age             float   %9.0g                 in years
yrsmarr         float   %9.0g                 years married
kids            byte    %9.0g                 =1 if have kids
relig           byte    %9.0g                 5 = very relig., 4 = somewhat, 3 = slightly, 2 = not at
                                                all, 1 = anti
educ            byte    %9.0g                 years schooling
occup           byte    %9.0g                 occupation, reverse Hollingshead scale
ratemarr        byte    %9.0g                 5 = vry hap marr, 4 = hap than avg, 3 = avg, 2 = smewht
                                                unhap, 1 = vry unhap
naffairs        byte    %9.0g                 number of affairs within last year
affair          byte    %9.0g                 =1 if had at least one affair
vryhap          byte    %9.0g                 ratemarr == 5
hapavg          byte    %9.0g                 ratemarr == 4
avgmarr         byte    %9.0g                 ratemarr == 3
unhap           byte    %9.0g                 ratemarr == 2
vryrel          byte    %9.0g                 relig == 5
smerel          byte    %9.0g                 relig == 4
slghtrel        byte    %9.0g                 relig == 3
notrel          byte    %9.0g                 relig == 2
-------------------------------------------------------------------------------------------------------
Sorted by:  id

Here's the Stata equivalent of your censReg:

. tobit naffair age yrsmarr relig occup ratemarr , ll(0)

Tobit regression                                  Number of obs   =        601
                                                  LR chi2(5)      =      78.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -705.57622                       Pseudo R2       =     0.0526

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.1793326   .0790928    -2.27   0.024    -.3346672    -.023998
     yrsmarr |   .5541418   .1345172     4.12   0.000     .2899564    .8183273
       relig |   -1.68622   .4037495    -4.18   0.000    -2.479165   -.8932758
       occup |   .3260532   .2544235     1.28   0.201    -.1736224    .8257289
    ratemarr |  -2.284973   .4078258    -5.60   0.000    -3.085923   -1.484022
       _cons |   8.174197   2.741432     2.98   0.003     2.790155    13.55824
-------------+----------------------------------------------------------------
      /sigma |    8.24708   .5533582                      7.160311    9.333849
------------------------------------------------------------------------------
  Obs. summary:        451  left-censored observations at naffairs<=0
                       150     uncensored observations
                         0 right-censored observations

Stata reports $\sigma$ rather than $\ln \sigma$, but we can take logs too:

. nlcom logSigma: ln(_b[/sigma])

    logSigma:  ln(_b[/sigma])

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    logSigma |   2.109859   .0670975    31.44   0.000     1.978351    2.241368

Note that this matches your R output. The z stat and the p-value are for the null that the log standard deviation of the residual is zero, which is definitely not the case here.

Here are the summary stats for the outcome for comparison to $\sigma$:

. sum naffairs        
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
        naffairs |       601    1.455907    3.298758          0         12

In this case, the model looks pretty bad, which is often the case with Tobit models, especially "toy" ones meant to illustrate syntax.

Best Answer

Related Solutions

Solved – How should I handle a left censored predictor variable in multiple regression

Tobit Regression – How to Explain a Censored Regression Model

Related Question