Solved – Predict future student outcomes (binary and continuous) with historic cross-sectional data

logisticpredictive-modelsprobabilityregressionstata

Using Stata 11.2, I would like to develop 2 analytic models that could be implemented by school administrators to flag students for intervention. I'm wondering if it would be possible to develop these models based on historical cross-sectional data consisting of 650,000 unique observations (11th grade students); it was collected from 2009 through 2011.

First, I want to predict risk of dropping out of school (with the goal of intervening in 5% of students with the highest risk of dropout). Second, I want to predict absence (with the goal of intervening among students in the top 5% of predicted absentee time).

Outcome variables: dropout (binary; 0=no, 1=yes) and absence (continuous; cumulative hours absent from school)

Predictor variables: sex (binary), track (3-categories), GPA (continuous)

So far, I've just worked on predicting dropout, but am not sure if what I'm doing is theoretically correct or statistically sound. (It's been a long time since I took stats!) Here what I've done, using the Stata command:

logistic dropout i.sex i.track gpa

The output shows all independent variables are significantly associated with dropout, based on p<0.05 and 95% CIs for the ORs that do not contain 0. Sex (OR=0.95), Track (OR=0.76), GPA (OR= 1.52).

Now I'm not sure about how to proceed in calculating the predicted probabilities of dropout, and then figuring out which students are at greatest risk of dropout. Should the command be predict phat ?

I think this gives the predicted prob of dropout for each level of each variable, holding all other variables at their mean. Then I would just categorize the predicted values into 20 categories, and the students in the top category would be those who should be targeted for intervention?

I would greatly appreciate recommendations on how to conduct both analyses (binary and continuous outcomes).

Best Answer

You have to combine, or rather to stack, the historic data and the actual data. In Stata, this can be done via the append command.

Then you have to use the logit and regress commands to estimate the models.

In a last step, you have use predict to obtain the out-of-sample predictions for the outcome variable. In the case of a logit model, predict gives you the predicted probability of observing an outcome, given a set of values for the covariates.

Note that your understanding of the predict command is not quite correct. What you describe rather corresponds to the margins command.

Example

I use a dataset from William Greene's Econometric Analysis textbook, availabe on the web:

clear
infile obs gpa tuce psi grade ///
    using "http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF14-1.txt" 

I want to predict the probability of a grade increase. For the sake of illusration I assume that the first 20 observations of the dataset are the historical data, and that the remaining 12 observations are the ones for which you want to have a prediction. Hence, I set observations 21 to 32 missing.

replace grade = . if _n > 20

I estimate a logistic regression model that predicts the probability of a grade increase (grade = 1) as a function of the gpa and tuce scores.

logit grade gpa tuce

Note that I have used the logit rather than the logistic command, because I want to resue the coefficient in a moment. Now I can compute the predicted probabilities, for all the observations.

predict phat1, p

To understand, or to be sure of what predict does, I compute the following.

local X "_b[_cons] + _b[gpa]*gpa + _b[tuce]*tuce"
generate phat2 = exp(`X') /(1 + exp(`X'))

I could have direclty used Stata's invlogit function:

generate phat3 = invlogit(`X')

All the three methods yield the same results.

summarize phat1 phat2 phat3

With a continuous outcome and the regress command, the principle is the same. Here is an illustration.

replace gpa = . if _n > 20
regress gpa tuce psi
predict xbhat1, xb
gen xbhat2 = _b[_cons] + _b[tuce] * tuce + _b[psi] * psi
summarize xbhat1 xbhat2
Related Question