You have to combine, or rather to stack, the historic data and the actual data. In Stata, this can be done via the append
command.
Then you have to use the logit
and regress
commands to estimate the models.
In a last step, you have use predict
to obtain the out-of-sample predictions for the outcome variable. In the case of a logit model, predict
gives you the predicted probability of observing an outcome, given a set of values for the covariates.
Note that your understanding of the predict
command is not quite correct. What you describe rather corresponds to the margins
command.
Example
I use a dataset from William Greene's Econometric Analysis textbook, availabe on the web:
clear
infile obs gpa tuce psi grade ///
using "http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF14-1.txt"
I want to predict the probability of a grade increase. For the sake of illusration I assume that the first 20 observations of the dataset are the historical data, and that the remaining 12 observations are the ones for which you want to have a prediction. Hence, I set observations 21 to 32 missing.
replace grade = . if _n > 20
I estimate a logistic regression model that predicts the probability of a grade increase (grade = 1) as a function of the gpa and tuce scores.
logit grade gpa tuce
Note that I have used the logit
rather than the logistic
command, because I want to resue the coefficient in a moment. Now I can compute the predicted probabilities, for all the observations.
predict phat1, p
To understand, or to be sure of what predict
does, I compute the following.
local X "_b[_cons] + _b[gpa]*gpa + _b[tuce]*tuce"
generate phat2 = exp(`X') /(1 + exp(`X'))
I could have direclty used Stata's invlogit
function:
generate phat3 = invlogit(`X')
All the three methods yield the same results.
summarize phat1 phat2 phat3
With a continuous outcome and the regress
command, the principle is the same. Here is an illustration.
replace gpa = . if _n > 20
regress gpa tuce psi
predict xbhat1, xb
gen xbhat2 = _b[_cons] + _b[tuce] * tuce + _b[psi] * psi
summarize xbhat1 xbhat2
An ordered logit model is more appropriate as you have a dependent variable which is a ranking, 7 is better than 4 for instance. So there is a clear order.
This allows you to obtain a probability for each bin.
There are few assumptions that you need to take into account. You can have a look here.
One of the assumptions underlying ordinal logistic (and ordinal
probit) regression is that the relationship between each pair of
outcome groups is the same. In other words, ordinal logistic
regression assumes that the coefficients that describe the
relationship between, say, the lowest versus all higher categories of
the response variable are the same as those that describe the
relationship between the next lowest category and all higher
categories, etc. This is called the proportional odds assumption or
the parallel regression assumption.
Some code:
library("MASS")
## fit ordered logit model and store results 'm'
m <- polr(Y ~ X1 + X2 + X3, data = dat, Hess=TRUE)
## view a summary of the model
summary(m)
You can have further explanations here, here,here or here.
Keep in mind that you will need to transform your coefficients to odds ratio and then to probabilities to have a clear interpretation in terms of probabilities.
In a straightforward (and simplistic manner) you can compute these by:
$exp(\beta_{i})=Odds Ratio$
$\frac{exp(\beta_{1})}{\sum exp(\beta_{i})} = Probability$
(Don't want to be too technical)
Best Answer
@IsabellaGhement provided one reasonable way to handle this in comments. If your substantive interest is the degree of racial segregation between high schools, I'd encourage you to read up on entropy indexes, which quantify the degree of segregation. If you're familiar with the Hirschman-Herfindahl index in economics, they essentially get at the same concept.