Solved – Including % Race/Ethnicity in Regression Model

regressionregression coefficients

I have been examining high school graduation rates, and wanted to include race/ethnicity as a control. The only data available is % of students identifying as one of 7 race/ethnicity categories. My concern is that these percentages are inherently dependent upon each other.

Is it appropriate to include each of those percentages as independent variables in the regression? If not, is there a method for handling such dependent variables?

Best Answer

@IsabellaGhement provided one reasonable way to handle this in comments. If your substantive interest is the degree of racial segregation between high schools, I'd encourage you to read up on entropy indexes, which quantify the degree of segregation. If you're familiar with the Hirschman-Herfindahl index in economics, they essentially get at the same concept.

Related Solutions

Solved – Predict future student outcomes (binary and continuous) with historic cross-sectional data

You have to combine, or rather to stack, the historic data and the actual data. In Stata, this can be done via the append command.

Then you have to use the logit and regress commands to estimate the models.

In a last step, you have use predict to obtain the out-of-sample predictions for the outcome variable. In the case of a logit model, predict gives you the predicted probability of observing an outcome, given a set of values for the covariates.

Note that your understanding of the predict command is not quite correct. What you describe rather corresponds to the margins command.

Example

I use a dataset from William Greene's Econometric Analysis textbook, availabe on the web:

clear
infile obs gpa tuce psi grade ///
    using "http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF14-1.txt"

I want to predict the probability of a grade increase. For the sake of illusration I assume that the first 20 observations of the dataset are the historical data, and that the remaining 12 observations are the ones for which you want to have a prediction. Hence, I set observations 21 to 32 missing.

replace grade = . if _n > 20

I estimate a logistic regression model that predicts the probability of a grade increase (grade = 1) as a function of the gpa and tuce scores.

logit grade gpa tuce

Note that I have used the logit rather than the logistic command, because I want to resue the coefficient in a moment. Now I can compute the predicted probabilities, for all the observations.

predict phat1, p

To understand, or to be sure of what predict does, I compute the following.

local X "_b[_cons] + _b[gpa]*gpa + _b[tuce]*tuce"
generate phat2 = exp(`X') /(1 + exp(`X'))

I could have direclty used Stata's invlogit function:

generate phat3 = invlogit(`X')

All the three methods yield the same results.

summarize phat1 phat2 phat3

With a continuous outcome and the regress command, the principle is the same. Here is an illustration.

replace gpa = . if _n > 20
regress gpa tuce psi
predict xbhat1, xb
gen xbhat2 = _b[_cons] + _b[tuce] * tuce + _b[psi] * psi
summarize xbhat1 xbhat2

Regression for Wine Rating – Linear Regression vs. Ordinal Logistic Regression to Predict Wine Rating from 0 to 10

An ordered logit model is more appropriate as you have a dependent variable which is a ranking, 7 is better than 4 for instance. So there is a clear order.

This allows you to obtain a probability for each bin. There are few assumptions that you need to take into account. You can have a look here.

One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption.

Some code:

library("MASS")
## fit ordered logit model and store results 'm'
m <- polr(Y ~ X1 + X2 + X3, data = dat, Hess=TRUE)

## view a summary of the model
summary(m)

You can have further explanations here, here,here or here.

Keep in mind that you will need to transform your coefficients to odds ratio and then to probabilities to have a clear interpretation in terms of probabilities.

In a straightforward (and simplistic manner) you can compute these by:

$exp(\beta_{i})=Odds Ratio$

$\frac{exp(\beta_{1})}{\sum exp(\beta_{i})} = Probability$

(Don't want to be too technical)

Best Answer

Related Solutions

Solved – Predict future student outcomes (binary and continuous) with historic cross-sectional data

Regression for Wine Rating – Linear Regression vs. Ordinal Logistic Regression to Predict Wine Rating from 0 to 10

Related Question