Tobit Regression – How to Explain a Censored Regression Model

rregressiontobit-regression

I'm wondering how to explain the logSigma and its p.value in a censored regression model:

require(cenReg)
data( "Affairs", package = "AER" )
estResult <- censReg( affairs ~ age + yearsmarried + religiousness +
                           occupation + rating, data = Affairs )
summary(estResult)

Call:
censReg(formula = affairs ~ age + yearsmarried + religiousness + 
    occupation + rating, data = Affairs)

Observations:
         Total  Left-censored     Uncensored Right-censored 
           601            451            150              0 

Coefficients:
              Estimate Std. error t value  Pr(> t)    
(Intercept)    8.17420    2.74145   2.982  0.00287 ** 
age           -0.17933    0.07909  -2.267  0.02337 *  
yearsmarried   0.55414    0.13452   4.119 3.80e-05 ***
religiousness -1.68622    0.40375  -4.176 2.96e-05 ***
occupation     0.32605    0.25442   1.282  0.20001    
rating        -2.28497    0.40783  -5.603 2.11e-08 ***
logSigma       2.10986    0.06710  31.444  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero
Log-likelihood: -705.6 on 7 Df

My question is how to explain the logSigma and its significant p.value in the above model.

Best Answer

By default, the estimated standard deviation of the residuals ($\sigma$) is returned as $\ln(\sigma)$ since that is how the Tobit log likelihood maximization is performed. If you use coef(estResult,logSigma = FALSE), you will get $\sigma$ instead, which is analogous to the square root of the residual variance in OLS regression. That value can be compared to the standard deviation of affairs. If it is much smaller, you may have a reasonably good model. Or you can do the exponentiation yourself with a calculator and use delta method for the variance. You will also need $\sigma$ to construct some of the marginal effects.

I don't think the hypothesis test about $\ln \sigma$ and the corresponding p-value have a clear interpretation, whereas the other coefficients can be interpreted as the marginal effects on the uncensored outcome, so the p-value on the null that the ME is zero makes sense for them. I believe R is just treating $\ln \sigma$ as another parameter.

Here's my replication of your analysis in Stata (where I am also treating the categorical variables as continuous) confirming what I wrote above.

First we load the affairs data:

. ssc install bcuse
checking bcuse consistency and verifying not already installed...
all files already exist and are up to date.

. bcuse affairs

Contains data from http://fmwww.bc.edu/ec-p/data/wooldridge/affairs.dta
  obs:           601                          
 vars:            19                          22 May 2002 11:49
 size:        15,626                          
-------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------
id              int     %9.0g                 identifier
male            byte    %9.0g                 =1 if male
age             float   %9.0g                 in years
yrsmarr         float   %9.0g                 years married
kids            byte    %9.0g                 =1 if have kids
relig           byte    %9.0g                 5 = very relig., 4 = somewhat, 3 = slightly, 2 = not at
                                                all, 1 = anti
educ            byte    %9.0g                 years schooling
occup           byte    %9.0g                 occupation, reverse Hollingshead scale
ratemarr        byte    %9.0g                 5 = vry hap marr, 4 = hap than avg, 3 = avg, 2 = smewht
                                                unhap, 1 = vry unhap
naffairs        byte    %9.0g                 number of affairs within last year
affair          byte    %9.0g                 =1 if had at least one affair
vryhap          byte    %9.0g                 ratemarr == 5
hapavg          byte    %9.0g                 ratemarr == 4
avgmarr         byte    %9.0g                 ratemarr == 3
unhap           byte    %9.0g                 ratemarr == 2
vryrel          byte    %9.0g                 relig == 5
smerel          byte    %9.0g                 relig == 4
slghtrel        byte    %9.0g                 relig == 3
notrel          byte    %9.0g                 relig == 2
-------------------------------------------------------------------------------------------------------
Sorted by:  id

Here's the Stata equivalent of your censReg:

. tobit naffair age yrsmarr relig occup ratemarr , ll(0)

Tobit regression                                  Number of obs   =        601
                                                  LR chi2(5)      =      78.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -705.57622                       Pseudo R2       =     0.0526

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.1793326   .0790928    -2.27   0.024    -.3346672    -.023998
     yrsmarr |   .5541418   .1345172     4.12   0.000     .2899564    .8183273
       relig |   -1.68622   .4037495    -4.18   0.000    -2.479165   -.8932758
       occup |   .3260532   .2544235     1.28   0.201    -.1736224    .8257289
    ratemarr |  -2.284973   .4078258    -5.60   0.000    -3.085923   -1.484022
       _cons |   8.174197   2.741432     2.98   0.003     2.790155    13.55824
-------------+----------------------------------------------------------------
      /sigma |    8.24708   .5533582                      7.160311    9.333849
------------------------------------------------------------------------------
  Obs. summary:        451  left-censored observations at naffairs<=0
                       150     uncensored observations
                         0 right-censored observations

Stata reports $\sigma$ rather than $\ln \sigma$, but we can take logs too:

. nlcom logSigma: ln(_b[/sigma])

    logSigma:  ln(_b[/sigma])

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    logSigma |   2.109859   .0670975    31.44   0.000     1.978351    2.241368

Note that this matches your R output. The z stat and the p-value are for the null that the log standard deviation of the residual is zero, which is definitely not the case here.

Here are the summary stats for the outcome for comparison to $\sigma$:

. sum naffairs        
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
        naffairs |       601    1.455907    3.298758          0         12

In this case, the model looks pretty bad, which is often the case with Tobit models, especially "toy" ones meant to illustrate syntax.

Related Solutions

Solved – Tobit model explanation

The wiki describes the Tobit model as follows:

$$y_i = \begin{cases} y_i^* &\text{if} \quad y_i^* > 0 \\ \ 0 &\text{if} \quad y_i^* \le 0 \end{cases}$$

$$y_i^* = \beta x_i + u_i$$

$$u_i \sim N(0,\sigma^2)$$

I will adapt the above model with to your context and offer a plain english interpretation of the equations which may be helpful.

$$y_i = \begin{cases}\ y_i^* &\text{if} \quad y_i^* \le 30 \\ 30 &\text{if} \quad y_i^* > 30 \end{cases}$$

$$y_i^* = \beta x_i + u_i$$

$$u_i \sim N(0,\sigma^2)$$

In the above set of equations, $y_i^*$ represents a subject's ability. Thus, the first set of equations state the following:

Our measurements of ability is cut-off on the higher side at 30 (i.e., we capture the ceiling effect). In other words, if a person's ability is greater than 30 then our measurement instrument fails to record the actual value but instead records 30 for that person. Note that the model states $y_i = 30 \quad \text{if} \quad y_i^* > 30$.
If on the other hand a person's ability is less than 30 then our measurement instrument is capable of recording the actual measurement. Note that the model states $y_i = y_i^* \quad \text{if} \quad y_i^* \le 30$.
We model the ability, $y_i^*$, as a linear function of our covariates $x_i$ and an associated error term to capture noise.

I hope that is helpful. If some aspect is not clear feel free to ask in the comments.

R Regression – A Comprehensive Guide on Censored Regression in R

The predict() and fitted() methods for tobit objects compute the estimates for the latent mean $\mu = E[y^*] = x^\top \beta$. Additionally, the scale parameter $\sigma$ is available as $scale in the objects:

mu <- fitted(fit)
sigma <- fit$scale

The probability of a non-zero observation is then $P(y > 0) = \Phi(\mu/\sigma)$, i.e.:

p0 <- pnorm(mu/sigma)

The conditional expectation of the censored $y$ given that it is non-zero is $E[y | y > 0] = \mu + \sigma \cdot \lambda(\mu/\sigma)$, where $\lambda(\cdot)$ is the inverse Mills ratio $\lambda(x) = \phi(x)/\Phi(x)$:

lambda <- function(x) dnorm(x)/pnorm(x)
ey0 <- mu + sigma * lambda(mu/sigma)

Finally, the unconditional expectation is $E[y] = P(y > 0) \cdot E[y | y > 0]$, i.e.:

ey <- p0 * ey0

If you want to visualize everything together in a time series style plot:

plot(y, ylim = my.range)
lines(mu, col = "slategray")
lines(y.star, col = "black")
lines(ey0, col = "green")
lines(ey, col = "blue")

The reason that the predict() method for tobit objects does not provide all of this automatically is that for all the distributions other than the normal / Gaussian, the relationship is not that easy. But maybe we should at least support the normal case.

Best Answer

Related Solutions

Solved – Tobit model explanation

R Regression – A Comprehensive Guide on Censored Regression in R

Related Question