Solved – Convert SAS Logistic Regression to R

econometricsleast squareslogisticrsas

Here's my issue of the day:

At the moment I'm teaching myself Econometrics and making use of logistic regression. I have some SAS code and I want to be sure I understand it well first before trying to convert it to R. (I don't have and I don't know SAS). In this code, I want to model the probability for one person to be an 'unemployed employee'. By this I mean "age" between 15 and 64, and "tact" = "jobless". I want to try to predict this outcome with the following variables: sex, age and idnat (nationality number). (Other things being equal).

SAS code :

/* Unemployment rate : number of unemployment amongst the workforce */
proc logistic data=census;
class sex(ref="Man") age idnat(ref="spanish") / param=glm;
class tact (ref=first);
model tact = sex age idnat / link=logit;
where 15<=age<=64 and tact in ("Employee" "Jobless");
weight weight;
format age ageC. tact $activity. idnat $nat_dom. inat $nationalty. sex $M_W.;

lsmeans sex / obsmargins ilink;
lsmeans idnat / obsmargins ilink;
lsmeans age / obsmargins ilink;
run;

This is a sample of what the database should looks like :

      idnat     sex     age  tact      
 [1,] "english" "Woman" "42" "Employee"
 [2,] "french"  "Woman" "31" "Jobless" 
 [3,] "spanish" "Woman" "19" "Employee"
 [4,] "english" "Man"   "45" "Jobless" 
 [5,] "english" "Man"   "34" "Employee"
 [6,] "spanish" "Woman" "25" "Employee"
 [7,] "spanish" "Man"   "39" "Jobless" 
 [8,] "spanish" "Woman" "44" "Jobless" 
 [9,] "spanish" "Man"   "29" "Employee"
[10,] "spanish" "Man"   "62" "Retired" 
[11,] "spanish" "Man"   "64" "Retired" 
[12,] "english" "Woman" "53" "Jobless" 
[13,] "english" "Man"   "43" "Jobless" 
[14,] "french"  "Man"   "61" "Retired" 
[15,] "french"  "Man"   "50" "Employee"

This is the kind of result I wish to get :

Variable    Modality    Value   ChiSq   Indicator
Sex         Women       56.6%   0.00001 -8.9%
            Men         65.5%       
Nationality 
            1:Spanish   62.6%       
            2:French    51.2%   0.00001 -11.4%
            3:English   48.0%   0.00001 -14.6%
Age 
            <25yo       33.1%   0.00001 -44.9%
        Ref:26<x<54yo   78.0%       
            55yo=<      48.7%   0.00001 -29.3%

Indicator is P(category)-P(ref)
(I interpret the above as follows: other things being equal, women have -8.9% chance of being employed vs men and those aged less than 25 have a -44.9% chance of being employed than those aged between 26 and 54).

So if I understand well, the best approach would be to use a binary logistic regression (link=logit). This uses references "male vs female"(sex), "employee vs jobless"(from 'tact' variable)… I presume 'tact' is automatically converted to a binary (0-1) variable by SAS.

Here is my 1st attempt in R. I haven't check it yet (need my own PC) :

### before using glm function 
### change all predictors to factors and relevel reference
recens$sex <- relevel(factor(recens$sex), ref = "Man")
recens$idnat <- relevel(factor(recens$idnat), ref = "spanish")  
recens$tact <- relevel(factor(recens$tact), ref = "Employee")
recens$ageC <- relevel(factor(recens$ageC), ref = "Ref : De 26 a 54 ans")

### Calculations of the probabilities with function glm, 
### formatted variables, and conditions with subset restriction to "from 15yo to 64"
### and "employee" and "jobless" only.
glm1 <- glm(activite ~ sex + ageC + idnat, data=recens, weights = weight, 
            subset= recens$age[(15<= recens$age | recens$age <= 64)] 
        & recens$tact %in% c("Employee","Jobless"), 
            family=quasibinomial("logit"))

My questions :

For the moment, it seems there are many functions to carry out a logistic regression in R like glm which seems to fit.

However after visiting many forums it seems a lot of people recommend not trying to exactly reproduce SAS PROC LOGISTIC, particularly the function LSMEANS. Dr Franck Harrel, (author of package:rms) for one.

That said, I guess my big issue is LSMEANS and its options Obsmargins and ILINK. Even after reading over its description repeatedly I can hardly understand how it works.

So far, what I understand of Obsmargin is that it respects the structure of the total population of the database (i.e. calculations are done with proportions of the total population). ILINK appears to be used to obtain the predicted probability value (jobless rate, employment rate) for each of the predictors (e.g. female then male) rather than the value found by the (exponential) model?

In short, how could this be done through R, with lrm from rms or lsmeans?

I'm really lost in all of this. If someone could explain it to me better and tell me if I'm on the right track it would make my day.

Best Answer

The least squares means are, in a sense, weighted averages. You can obtain similar results by using predict.glm with a type="response" option and then summarising over the variable of interest. E.g. predict the umeployment rates for all the data, and then summarise for all Males / Females and calculate the difference.

SAS's ilink option is doing the same thing by inverting the link (logit) function and turning the estimates from log odds back into probabilities - converting the estimates to the scale of the response variable (i.e. probability).

Be careful reporting differences in probability derived this way. As an extreme example, if there was a sub-population with a Male unemployment rate of zero, would you expect the unemployment rate for Females in that same sub-group to to be zero, or -8.9 percent?

Related Question