Solved – Tobit versus OLS

censoringeconometricstobit-regression

There is a dependent variable which is measured in £ and can take the form of £0-£100,000. It is effectively the value of the payment made. If it takes the form of £0 it means a payment was not made (because it wasn't authorised). If it takes the form of any number between £0.01 and £100,000 it means the payment was authorised for that amount.

We are interested in looking at how the value of the payment varies according to other information we hold (independent variables).

One of the proposals being put forward is to use a Tobit model with a lower limit of 0. I haven't worked explicitly with a Tobit model before but I can't see why it is appropriate. Payments cannot be negative. There is no censoring at 0 – those points at 0 are simply those where payments haven't been made.

My intuition is to remove all of the observations where the payment made is £0 and simply to run OLS on the observations where the dependent variable is £0.01 – £100,000.

Is the Tobit approach justifiable? Is it as simple as running OLS on a truncated data set?

Edit – To clarify – my concern on whether the Tobit model is appropriate rests on my confusion as to whether the £0 responses are 'corner solutions.' I appreciate, as an unrelated example, that someone can decide to provide no donation (£0) or a donation (>£0). This is a decision made with respect to their preferences/views. However, a £0 in my scenario is simply an entry to represent the fact that a payment has not been made.

Edit2 – To clarify further:

There are a lot of requests for payments which come to us. These all take the form of £0.01 – £100,000. To be paid out, they have to be authorised. Only a subset of the requests are authorised. Those we do not authorise have a payment made of £0.

So we have 2 fields:

  1. Payment requested – these will also take the form of £0.01 – £100,000. Noone will ever request a payment of £0 as it would be illogical
  2. Payment made – these will take the form of £0.01 – £100,000 for those payments we did authorise and the form of £0 for those payments we did not authorise. This variable is the focus of our analysis as we want to understand how the value of the payment made varies according to information we hold about the request (i.e. who requested it, what department are they in, what was the request for).

Payments are requested but not authorised because they do not meet eligibility criteria we have in the business. To give a somewhat-related example: someone can make a request for payment of £50 to reimburse an expense they incurred. However, this would fall outside our eligibility criteria and therefore we would not authorise the payment and the payment made would be £0.

This suggests to me the following:

£0 entries in the payment made field are due to our eligibility criteria as a business.
£0.01 – £100,000 entries in the payment made field are due to information we hold about the payment itself.

The typical example I see for 'corner solutions' is charitable giving by an individual. Here, the decision to give £0 or >£0 are both decisions made with respect to the characteristics of the individual.

I am curious thus if the 'corner solution' approach still applies to my scenario. The characteristics which determine the size of the payment made and the characteristics which determine whether we authorise payments are not necessarily the same characteristics.

Best Answer

In this case you are ignoring that a payment between £0.01 and £100,000 is not the only choice but that not buying, i.e. £0 is a choice in itself as well. These are two separate selection mechanisms of which the first is continuous (the value of the payment) and the latter is discrete (the yes/no decision of buying at all). You can represent this by assuming that both processes are drawn from the same continuous latent variable $y_i^*$ that linearly depends on your explanatory variables $x_i$, $$y_i^* = x_i'\beta + \epsilon_i$$ with $\epsilon_i \sim N(0,\sigma^2)$ iid errors. The variable you observe is censored below by zero $$ y_i = \begin{cases} y_i^* \;\text{if}\; y_i^* > 0 \\ 0 \;\:\;\text{if}\; y_i^* \leq 0 \end{cases} $$ where the zero case is generally referred to as corner solution. This variable is a mixture of the aforementioned process of continuous values of payment above zero with density $$f(y_i|x_i)=\sigma \phi\left[\frac{(y_i - x_i'\beta)}{\sigma}\right]$$ and the discrete choice process of paying or not, $$ P(y_i=0|x_i) = P(y_i^*\leq 0|x_i) = \Phi\left(\frac{-x_i'\beta}{\sigma}\right) $$

In your regression model you will not estimate the partial effect $\beta_k$ for a given explanatory variable $x_k$ but instead you will get $$ \frac{\partial E(y_i^*|x_i)}{\partial x_{ik}} = \beta_k \Phi\left(\frac{x_i'\beta}{\sigma}\right) $$

If you are interested in how to arrive at this conclusion, take the mixture of $y_i$ and obtain the conditional expectation for $E(y_i|x_i)$ - this is actually included below if you read on. Then apply the Frisch-Waugh theorem to obtain the coefficient $\beta_k$ for a given explanatory variable $x_k$. This should also be demonstrated in econometrics textbooks like Wooldridge (2010).

McDonald and Moffitt (1980) have shown that this partial effect can be decomposed as $$ \frac{\partial E(y_i^*|x_i)}{\partial x_{ik}} = \frac{\partial E(y_i^*|y_i^*>0,x_i)}{\partial x_{ik}}P(y_i^*>0) + \frac{\partial P(y_i^*>0)}{\partial x_{ik}}E(y_i^*|y_i^*>0,x_i) $$ which is the effect on the conditional expectation of fully observed values plus the effect on the probability of being fully observed. If you purposefully truncate your data by excluding all zero observations from the regression you are leaving the second part of this partial effect in the error term which then results in an endogeneity problem that will bias your results.

If you do not exclude the zero cases from your data, your regression will still be biased because when you run the regression $$y_i = x_i'\beta + u_i$$ the conditional expectation $E(y_i|x_i) = x_i'\beta\Phi\left(\frac{x_i'\beta}{\sigma}\right) + \sigma \phi \left(\frac{x_i'\beta}{\sigma}\right)$ is not a linear function of $x_i$.

The Tobit model solves this via finding the solution of the log likelihood of the mixture variable $y_i$. Note that the Tobit model is only a solution to this problem if the error term is normally distributed and homoscedastic. However, there are also semiparametric methods that can deal with this case if normality of the residuals is not a credible assumption.

If you would like to know more about the Tobit model I would recommend the corresponding book chapter in Wooldridge (2010). Otherwise there are also some excellent lecture slides out there like Blundell (2014).

Related Question