Solved – Tobit model explanation

tobit-regression

We have 100 participants in two groups, $n=50$ in each group. We used an assessment of ability of basic functioning at 4 time-points. The assessment comprises 6 questions, each scored 0 – 5. We do not have individual scores for each question, just total scores that range from 0 – 30. Higher scores indicate better functioning. The problem is that the assessment is very basic and has a significant ceiling effect. Results are very negatively skewed. The majority of participants scored close to 30, especially at the 3 follow-up time-points. It is likely that not all of the participants who scored at the upper limits are truly equal in ability: some of the participants were just about scoring 30 and others scored 30 with ease and would score much higher if it were possible and so the data are censored from above.

I want to compare the two groups and over time but obviously this is very difficult given the nature of the results. Transformations of any kind make no difference. I have been advised that the Tobit model is the best equipped for this assessment and I can run the analysis in R using examples from Arne Henningen’s paper, Estimating censored regression models in R using the censReg package.

However, I have only a basic knowledge of statistics and have found information on the Tobit model to be quite complicated. I need to be able to explain this model in plain language and I cannot find a plain language, nuts and bolts explanation as to what the Tobit model actually does and how. Can anyone explain the Tobit model or point me in the direction of a readable reference without complicated statistical and mathematical explanations?

Extremely grateful for any help

Best Answer

The wiki describes the Tobit model as follows:

$$y_i = \begin{cases} y_i^* &\text{if} \quad y_i^* > 0 \\ \ 0 &\text{if} \quad y_i^* \le 0 \end{cases}$$

$$y_i^* = \beta x_i + u_i$$

$$u_i \sim N(0,\sigma^2)$$

I will adapt the above model with to your context and offer a plain english interpretation of the equations which may be helpful.

$$y_i = \begin{cases}\ y_i^* &\text{if} \quad y_i^* \le 30 \\ 30 &\text{if} \quad y_i^* > 30 \end{cases}$$

$$y_i^* = \beta x_i + u_i$$

$$u_i \sim N(0,\sigma^2)$$

In the above set of equations, $y_i^*$ represents a subject's ability. Thus, the first set of equations state the following:

Our measurements of ability is cut-off on the higher side at 30 (i.e., we capture the ceiling effect). In other words, if a person's ability is greater than 30 then our measurement instrument fails to record the actual value but instead records 30 for that person. Note that the model states $y_i = 30 \quad \text{if} \quad y_i^* > 30$.
If on the other hand a person's ability is less than 30 then our measurement instrument is capable of recording the actual measurement. Note that the model states $y_i = y_i^* \quad \text{if} \quad y_i^* \le 30$.
We model the ability, $y_i^*$, as a linear function of our covariates $x_i$ and an associated error term to capture noise.

I hope that is helpful. If some aspect is not clear feel free to ask in the comments.

Censored vs. inflated vs. hurdle

Censored, hurdle, and inflated models work by adding a point mass on top of an existing probability density. The difference lies in where the mass is added, and how. For now, just consider adding a point mass at 0, but the concept generalizes easily to other cases.

All of them imply a two-step data generating process for some variable $Y$:

Draw to determine whether $Y = 0$ or $Y > 0$.
If $Y > 0$, draw to determine the value of $Y$.

Inflated and hurdle models

Both inflated (usually zero-inflated) and hurdle models work by explicitly and separately specifying $\operatorname{Pr}(Y = 0) = \pi$, so that the DGP becomes:

Draw once from $Z \sim Bernoulli(\pi)$ to obtain realization $z$.
If $z = 0$, set $y = z = 0$.
If $z = 1$, draw once from $Y^* \sim D^*(\theta^*)$ and set $y = y^*$.

In an inflated model, $\operatorname{Pr}(Y^* = 0) > 0$. In a hurdle model, $\operatorname{Pr}(Y^* = 0) = 0$. That's the only difference.

Both of these models lead to a density with the following form: $$ f_D(y) = \mathbb{I}(y = 0) \cdot \operatorname{Pr}(Y = 0) + \mathbb{I}(y \geq 0) \cdot \operatorname{Pr}(Y \geq 0) \cdot f_{D^*}(y) $$

where $\mathbb{I}$ is an indicator function. That is, a point mass is simply added at zero and in this case that mass is simply $\operatorname{Pr}(Z = 0) = 1 - \pi$. You are free to estimate $p$ directly, or to set $g(\pi) = X\beta$ for some invertible $g$ like the logit function. $D^*$ can also depend on $X\beta$. In that case, the model works by "layering" a logistic regression for $Z$ under another regression model for $Y^*$.

Censored models

Censored models also add mass at a boundary. They accomplish this by "cutting off" a probability distribution, and then "bunching up" the excess at that boundary. The easiest way to conceptualize these models is in terms of a latent variable $Y^* \sim D^*$ with CDF $F_{D^*}$. Then $\operatorname{Pr}(Y^* \leq y^*) = F_{D^*}(y^*)$. This is a very general model; regression is the special case in which $F_{D^*}$ depends on $X\beta$.

The observed $Y$ is then assumed to be related to $Y^*$ by: $$ Y = \begin{align}\begin{cases} 0 &Y^* \leq 0 \\ Y^* &Y^* > 0 \end{cases}\end{align} $$

This implies a density of the form $$ f_D(y) = \mathbb{I}(y = 0) \cdot F_{D^*}(0) + \mathbb{I}(y \geq 0) \cdot \left(1 - F_{D^*}(0)\right) \cdot f_{D^*}(y) $$

and can be easily extended.

Putting it together

Look at the densities: $$\begin{align} f_D(y) &= \mathbb{I}(y = 0) \cdot \pi &+ &\mathbb{I}(y \geq 0) \cdot \left(1 - \pi\right) &\cdot &f_{D^*}(y) \\ f_D(y) &= \mathbb{I}(y = 0) \cdot F_{D^*}(0) &+ &\mathbb{I}(y \geq 0) \cdot \left(1 - F_{D^*}(0)\right) &\cdot &f_{D^*}(y) \end{align}$$

and notice that they both have the same form: $$ \mathbb{I}(y = 0) \cdot \delta + \mathbb{I}(y \geq 0) \cdot \left(1 - \delta\right) \cdot f_{D^*}(y) $$

because they accomplish the same goal: building the density for $Y$ by adding a point mass $\delta$ to the density for some $Y^*$. The inflated/hurdle model sets $\delta$ by way of an external Bernoulli process. The censored model determines $\delta$ by "cutting off" $Y^*$ at a boundary, and then "clumping" the left-over mass at that boundary.

In fact, you can always postulate a hurdle model that "looks like" a censored model. Consider a hurdle model where $D^*$ is parameterized by $\mu = X\beta$ and $Z$ is parameterized by $g(\pi) = X\beta$. Then you can just set $g = F_{D^*}^{-1}$. An inverse CDF is always a valid link function in logistic regression, and indeed one reason logistic regression is called "logistic" is that the standard logit link is actually the inverse CDF of the standard logistic distribution.

You can come full circle on this idea, as well: Bernoulli regression models with any inverse CDF link (like the logit or probit) can be conceptualized as latent variable models with a threshold for observing 1 or 0. Censored regression is a special case of hurdle regression where the implied latent variable $Z^*$ is the same as $Y^*$.

Which one should you use?

If you have a compelling "censoring story," use a censored model. One classic usage of the Tobit model -- the econometric name for censored Gaussian linear regression -- is for modeling survey responses that are "top-coded." Wages are often reported this way, where all wages above some cutoff, say 100,000, are just coded as 100,000. This is not the same thing as truncation, where individuals with wages above 100,000 are not observed at all. This might occur in a survey that is only administered to individuals with wages under 100,000.

Another use for censoring, as described by whuber in the comments, is when you are taking measurements with an instrument that has limited precision. Suppose your distance-measuring device could not tell the difference between 0 and $\epsilon$. Then you could censor your distribution at $\epsilon$.

Otherwise, a hurdle or inflated model is a safe choice. It usually isn't wrong to hypothesize a general two-step data generating process, and it can offer some insight into your data that you might not have had otherwise.

On the other hand, you can use a censored model without a censoring story to create the same effect as a hurdle model without having to specify a separate "on/off" process. This is the approach of Sigrist and Stahel (2010), who censor a shifted gamma distribution just as a way to model data in $[0, 1]$. That paper is particularly interesting because it demonstrates how modular these models are: you can actually zero-inflate a censored model (section 3.3), or you can extend the "latent variable story" to several overlapping latent variables (section 3.1).

Truncation

Edit: removed, because this solution was incorrect

Best Answer

Related Solutions

R – Using Tobit Model for Data Analysis in R

Regression – How to Model a Continuous Dependent Variable in the $[0, \infty]$ Range?