Solved – Treating categorical variables in logistic regression in SAS

logisticsas

I am scoring my dataset using a logistic model. For getting the betas (coefficients) I used proc logistic y = x1 x2 x3.
To tell the model which are class variables I am usingclass statement. Now, lets say if I have a categorical variable (with name ppsc), which has 4 categories, the betas are generated for top 3 categories (ppsc1, ppsc2, ppsc3 ) and I guess the fourth category is taken as reference. Now to score, I am using proc score to score, therefore I have to generate 3 binary variables ppsc1 ppsc2 and ppsc3 such that

if ppsc =1 then ppsc1= 1 else ppsc1 = 0.

… ans same for ppsc2 and 3.

Now my questions are:

  1. If my category 4 is most important (number vise) and the logistic is generating betas for ppsc1, ppsc2 and ppsc3, what do I do? So for 4th category my logit(p) will be zero for that variable if the value of ppsc is 4? How do I handle this problem? Would this hold for other categorical variables too?

  2. I dont want to make those ppsc1, ppsc2 and ppsc3 binary variables. Can't I specify at the time of scoring that my following variables are categorical variable by some class statement just like we do in proc logistic?

I hope my problems are clear… I am BTW more concerned about 1st problem as that is something fundamental to understanding how the score is getting generated.

Best Answer

You should just use the output statement in the logistic procedure, then you'll get your predicted probabilities, plus some other things. So you have:

Proc logistic Data=<your dataset>;
class <your class variables>;
model <your model>;
Output out=<output data set name> p=<predicted probability> xbeta=<linear predictor>;
Run;

There are many other options, check the SAS documentation. So you don't need to separately score your observations - proc logistic does this for you.

In terms of dummy variable coding, it is easiest to write out the equations, so you can see what's going on. For ppsc1 we have (ignoring other covariates for the example) $\beta_{0}+\beta_{1}$, for ppsc2 we have $\beta_{0}+\beta_{2}$, for ppsc3 we have $\beta_{0}+\beta_{3}$. But for ppsc4 we have $\beta_{0}$ - hence the intercept is the effect due to ppsc4, and each of the other betas is a comparison (adjustment) to ppsc4.

Now suppose we change the reference group to be ppsc2. Then we will have a new intercept $\beta_{0}^{(1)}=\beta_{0}+\beta_{2}$, and the effect for ppsc1 will be changed to $\beta_{0}^{(1)}+\beta_{1}^{(1)}=\beta_{0}+\beta_{1}$. Using this we have $\beta_{1}^{(1)}=\beta_{1}-\beta_{2}$, and similarly for the other effects. Because of invariance of MLEs, your estimates will satisfy these equations.