Solved – Using PROC QLIM OUTPUT to get predicted values from a two-step Tobit model

predictionsastobit-regression

I am fitting a two-step Tobit model through PROC QLIM in SAS. The first step of the model is a probit model for whether someone "responds" (e.g. makes a donation). The second step of the model is linear for the amount (e.g. amount of the donation, given that someone made a donation). I am using a two-step Tobit model rather than a Tobit-1 model because in my actual data I suspect some selection bias in terms of who responds, and also because I may want to use different covariates for each step (presently, I am using the same ones).

Since PROC QLIM does not appear to support predict or score statements, I created dummy data in mydata by appending a copy of my dataset with the outcomes (response and amount) removed, while modifying the covariates such a way that I can to get predictions for a hypothetical dataset where test=0 throughout. Here is a sample of my code:

proc qlim data=mydata;
class test classvar1 classvar2 classvar3;
model response = test classvar1 classvar2 classvar3 test*classvar1 test*classvar2 test*classvar3 / discrete;
model amount = test classvar1 classvar2 classvar3 test*classvar1 test*classvar2 test*classvar3 / select(response=1);
output out=tempout conditional expected predicted prob mills;
run;

mydata has the following relevant fields:

response: 0/1 indicator of donation
amount: continuous value indicating the amount of donation; missing if response=0
test: 0/1 indicator of whether individual is in test or control group
classvar1 – classvar3: various categorical characteristics of individuals

What I am trying to get out of this is a predicted value that reflects each individual's expected donation amount, unconditional on whether they donated (so, the predicted value should include that probability of donation in some way). However, in the predicted values, I get only the following metrics related to amount:

P_amount (Predicted value of amount)
Expct_amount (Unconditional expected value of amount)

I do not get a "conditional" expected value of amount at all — instead, the P_amount and Expct_amount values above are equivalent to what I would expect the conditional expected value to be (and they are also equal to the Xbeta values for the amount model). In other words, in those predicted values, there does not appear to be any adjustment for the probability of response.

For other PROC QLIM models, such as a simple one-equation Tobit-1 model, I have seen both the conditional and unconditional expected values appear in output, and they differ from each other (i.e. the unconditional values are usually smaller, in some way related to the probability of response). Is there something I'm not specifying correctly that is causing me to get this output? The only clue I found in the logs is this:

Note: The Mills Ratio is not calculated for an ordinal discrete variable
or continuous variable without censoring or truncation

Happy to clarify further if needed. Thank you!

Best Answer

After consulting with SAS Support, it appears that PROC QLIM does not provide direct output to answer my question in the case of selection models like this one. However, other outputs can be used to compute what I seek. Here is the solution in case anyone else has this question.

In the output statement, I can request prob, xbeta, and mills:

output out=tempout xbeta mills prob;

And then, in a separate data step, I can calculate the following:

probability_of_response = (1-prob_response)*(response=0) + (prob_response)*(response=1)
amount_given_a_response = xbeta_amount + mills_response * SIGMA * RHO

Where:

SIGMA is the numeric value of _Sigma.amount from my model results
RHO is the numeric value of _Rho from my model results

Note that prob_response (output from PROC QLIM) is the "probability that the response equals this record's actual response," so it has to be inverted for non-responders in order to get the actual probability of responding. The second calculation follows the econometric formula:

E(y_i | z_i=1) = x_i * B + rho * sigma * (Mills ratio)

After creating these variables, I can simply multiply everyone's probability_of_response by their amount_given_a_response to get everyone's expected donation, accounting for their probability of donating rather than conditional on donating.

Related Solutions

Solved – Can value of predicted probability from logistic model be greater than one

The OP has explained in comments that they by error used the R glm function, but forgot to specify the argument family=binomial, that is, used the default gaussian family (with identity link.) But that is the usual (least squares) linear regression, not logistic regression, and clearly can give predictions outside the interval $(0,1)$.

Solved – Why I am that unsuccessful with predicting with Generalized Lasso (genlasso {genlasso})

There are two issues:

glmnet includes an intercept by default while genlasso does not. To disable the intercept in glmnet, use intercept=FALSE; it will reduce the performance significantly. To instead add an intercept in genlasso, cbind a column of ones onto your X matrix, and change D accordingly. If you choose to penalize it, you might need to set the corresponding value in the p+1 by p+1 D matrix more carefully than in the example code below. By setting penalize.intercept to FALSE in the code below, it is not penalized by using a p by p+1 penalty matrix D which is an identity matrix with the row corresponding to the intercept cut off.
the lambda values for glmnet and genlasso are on different scales. Note that glmnet optimizes $\frac{1}{2N}\sum_{i=1}^N(y-\beta_0-x_i^T\beta)^2+\lambda\left[(1-\alpha)\left\|\beta\right\|_2^2+\alpha\left\|\beta\right\|_1\right]$, while genlasso optimizes $\frac12\left\|y-X\beta\right\|_2^2+\lambda\left\|D\beta\right\|_1$; both use the $L^1$ norm for the lasso penalty, but glmnet's error term is equivalent to the $L^2$ norm divided by $2N$, while genlasso's formulation divides by just $2$. The $\lambda$ values from one must be rescaled to be applied to another. This is also an issue when using the CV'd $\lambda$ values for fitting a model with significantly different $N$, which is ignored in the code below. We can either scale lambda.cv to the appropriate value for genlasso, or use cross-validation to select an appropriate value directly. The code below takes the latter approach. Note that glmnet provides two methods of selecting the lambda value from cross-validation: the one giving the minimum CV loss (lambda.min), and one from the 1-standard-error rule, which adds some more regularization in exchange for a bit higher CV loss (lambda.1se); the 1se rule produces better out-of-sample performance in this case.

Code for glmnet without intercept, genlasso with intercept and CV-selected lambda:

set.seed(42) # set RNG seed for reproducibility 
n.train <- sum(train.idc) # number of training points
n.folds <- 10L # number of CV folds
foldid <- sample(rep_len(seq.int(n.folds), n.train)) # fold number for each training point
## (creates folds with roughly the same number of points)

train.x <- sim.sample.x[train.idc, ]
train.y <- sim.sample.y[train.idc]
test.x <- sim.sample.x[test.idc, ]
test.y <- sim.sample.y[test.idc]

## Call cv.glmnet specifying the foldid's:
cv.glmnet.fit <- cv.glmnet(x = train.x, 
                           y = train.y, 
                           foldid = foldid,
                           type.measure = "mse", alpha = 1)
## Choose the lambda with the minimum CV-estimated loss:
cv.glmnet.lambda.min.pred <- predict(cv.glmnet.fit,
                                     s = cv.glmnet.fit$lambda.min, #$
                                     newx = test.x)
## Using the 1-standard-error lambda selection rule:
cv.glmnet.lambda.1se.pred <- predict(cv.glmnet.fit,
                                     s = cv.glmnet.fit$lambda.1se, #$
                                     newx = test.x)
## with no intercept:
cv.glmnet0.fit <- cv.glmnet(x = train.x, 
                            y = train.y, 
                            foldid = foldid,
                            type.measure = "mse", alpha = 1,
                            intercept=FALSE)
cv.glmnet0.lambda.min.pred <- predict(cv.glmnet0.fit,
                                      s = cv.glmnet0.fit$lambda.min, #$ 
                                      newx = test.x)
cv.glmnet0.lambda.1se.pred <- predict(cv.glmnet0.fit,
                                      s = cv.glmnet0.fit$lambda.1se, #$
                                      newx = test.x)


## Get lambda sequence for genlasso on all training data:
D <- diag(1, p+1)
penalize.intercept <- FALSE
if (!penalize.intercept) D <- D[-1,]
genlasso.fit <- genlasso(y = train.y, 
                         X = cbind(1,train.x), 
                         D = D)
## Evaluate each lambda on each fold:
fold.lambda.losses <- tapply(seq_along(foldid), foldid, function(fold.indices) {
  fold.genlasso.fit <- genlasso(y = train.y[-fold.indices],
                                X = cbind(1,train.x[-fold.indices,]),
                                D = D)
  ## length(fold.indices)-by-length(cv.genlasso.fit$lambda) matrix, with
      ## predictions for this fold:
      ## $
  fold.genlasso.preds <- predict(fold.genlasso.fit,
                                 lambda = genlasso.fit$lambda, #$
                                 Xnew = cbind(1, train.x[fold.indices,]))$fit #$
  lambda.losses <- colMeans((fold.genlasso.preds - train.y[fold.indices])^2)
  return (lambda.losses)
})
## CV loss for each lambda:
cv.lambda.losses <- colMeans(do.call(rbind, fold.lambda.losses))
cv.genlasso.lambda.min <- genlasso.fit$lambda[which.min(cv.lambda.losses)] #$
## Caution, this lambda may need rescaling based on the ratio of the full training set size to the fold training set sizes.

## Predict:
cv.genlasso.lambda.min.pred <- predict(genlasso.fit,
                                       lambda = cv.genlasso.lambda.min,
                                       Xnew = cbind(1,test.x))$fit #$

summary(as.vector(cv.glmnet.lambda.min.pred))
summary(as.vector(cv.glmnet.lambda.1se.pred))
summary(as.vector(cv.glmnet0.lambda.min.pred))
summary(as.vector(cv.glmnet0.lambda.1se.pred))
summary(as.vector(cv.genlasso.lambda.min.pred))

mean((cv.glmnet.lambda.min.pred-test.y)^2)
mean((cv.glmnet.lambda.1se.pred-test.y)^2)
mean((cv.glmnet0.lambda.min.pred-test.y)^2)
mean((cv.glmnet0.lambda.1se.pred-test.y)^2)
mean((cv.genlasso.lambda.min.pred-test.y)^2)

Best Answer

Related Solutions

Solved – Can value of predicted probability from logistic model be greater than one

Solved – Why I am that unsuccessful with predicting with Generalized Lasso (genlasso {genlasso})

Related Question