Solved – Convert SAS Logistic Regression to R

econometricsleast squareslogisticrsas

Here's my issue of the day:

At the moment I'm teaching myself Econometrics and making use of logistic regression. I have some SAS code and I want to be sure I understand it well first before trying to convert it to R. (I don't have and I don't know SAS). In this code, I want to model the probability for one person to be an 'unemployed employee'. By this I mean "age" between 15 and 64, and "tact" = "jobless". I want to try to predict this outcome with the following variables: sex, age and idnat (nationality number). (Other things being equal).

SAS code :

/* Unemployment rate : number of unemployment amongst the workforce */
proc logistic data=census;
class sex(ref="Man") age idnat(ref="spanish") / param=glm;
class tact (ref=first);
model tact = sex age idnat / link=logit;
where 15<=age<=64 and tact in ("Employee" "Jobless");
weight weight;
format age ageC. tact $activity. idnat $nat_dom. inat $nationalty. sex $M_W.;

lsmeans sex / obsmargins ilink;
lsmeans idnat / obsmargins ilink;
lsmeans age / obsmargins ilink;
run;

This is a sample of what the database should looks like :

      idnat     sex     age  tact      
 [1,] "english" "Woman" "42" "Employee"
 [2,] "french"  "Woman" "31" "Jobless" 
 [3,] "spanish" "Woman" "19" "Employee"
 [4,] "english" "Man"   "45" "Jobless" 
 [5,] "english" "Man"   "34" "Employee"
 [6,] "spanish" "Woman" "25" "Employee"
 [7,] "spanish" "Man"   "39" "Jobless" 
 [8,] "spanish" "Woman" "44" "Jobless" 
 [9,] "spanish" "Man"   "29" "Employee"
[10,] "spanish" "Man"   "62" "Retired" 
[11,] "spanish" "Man"   "64" "Retired" 
[12,] "english" "Woman" "53" "Jobless" 
[13,] "english" "Man"   "43" "Jobless" 
[14,] "french"  "Man"   "61" "Retired" 
[15,] "french"  "Man"   "50" "Employee"

This is the kind of result I wish to get :

Variable    Modality    Value   ChiSq   Indicator
Sex         Women       56.6%   0.00001 -8.9%
            Men         65.5%       
Nationality 
            1:Spanish   62.6%       
            2:French    51.2%   0.00001 -11.4%
            3:English   48.0%   0.00001 -14.6%
Age 
            <25yo       33.1%   0.00001 -44.9%
        Ref:26<x<54yo   78.0%       
            55yo=<      48.7%   0.00001 -29.3%

Indicator is P(category)-P(ref)
(I interpret the above as follows: other things being equal, women have -8.9% chance of being employed vs men and those aged less than 25 have a -44.9% chance of being employed than those aged between 26 and 54).

So if I understand well, the best approach would be to use a binary logistic regression (link=logit). This uses references "male vs female"(sex), "employee vs jobless"(from 'tact' variable)… I presume 'tact' is automatically converted to a binary (0-1) variable by SAS.

Here is my 1st attempt in R. I haven't check it yet (need my own PC) :

### before using glm function 
### change all predictors to factors and relevel reference
recens$sex <- relevel(factor(recens$sex), ref = "Man")
recens$idnat <- relevel(factor(recens$idnat), ref = "spanish")  
recens$tact <- relevel(factor(recens$tact), ref = "Employee")
recens$ageC <- relevel(factor(recens$ageC), ref = "Ref : De 26 a 54 ans")

### Calculations of the probabilities with function glm, 
### formatted variables, and conditions with subset restriction to "from 15yo to 64"
### and "employee" and "jobless" only.
glm1 <- glm(activite ~ sex + ageC + idnat, data=recens, weights = weight, 
            subset= recens$age[(15<= recens$age | recens$age <= 64)] 
        & recens$tact %in% c("Employee","Jobless"), 
            family=quasibinomial("logit"))

My questions :

For the moment, it seems there are many functions to carry out a logistic regression in R like glm which seems to fit.

However after visiting many forums it seems a lot of people recommend not trying to exactly reproduce SAS PROC LOGISTIC, particularly the function LSMEANS. Dr Franck Harrel, (author of package:rms) for one.

That said, I guess my big issue is LSMEANS and its options Obsmargins and ILINK. Even after reading over its description repeatedly I can hardly understand how it works.

So far, what I understand of Obsmargin is that it respects the structure of the total population of the database (i.e. calculations are done with proportions of the total population). ILINK appears to be used to obtain the predicted probability value (jobless rate, employment rate) for each of the predictors (e.g. female then male) rather than the value found by the (exponential) model?

In short, how could this be done through R, with lrm from rms or lsmeans?

I'm really lost in all of this. If someone could explain it to me better and tell me if I'm on the right track it would make my day.

Best Answer

The least squares means are, in a sense, weighted averages. You can obtain similar results by using predict.glm with a type="response" option and then summarising over the variable of interest. E.g. predict the umeployment rates for all the data, and then summarise for all Males / Females and calculate the difference.

SAS's ilink option is doing the same thing by inverting the link (logit) function and turning the estimates from log odds back into probabilities - converting the estimates to the scale of the response variable (i.e. probability).

Be careful reporting differences in probability derived this way. As an extreme example, if there was a sub-population with a Male unemployment rate of zero, would you expect the unemployment rate for Females in that same sub-group to to be zero, or -8.9 percent?

Related Solutions

Solved – Treating categorical variables in logistic regression in SAS

You should just use the output statement in the logistic procedure, then you'll get your predicted probabilities, plus some other things. So you have:

Proc logistic Data=<your dataset>;
class <your class variables>;
model <your model>;
Output out=<output data set name> p=<predicted probability> xbeta=<linear predictor>;
Run;

There are many other options, check the SAS documentation. So you don't need to separately score your observations - proc logistic does this for you.

In terms of dummy variable coding, it is easiest to write out the equations, so you can see what's going on. For ppsc1 we have (ignoring other covariates for the example) $\beta_{0}+\beta_{1}$, for ppsc2 we have $\beta_{0}+\beta_{2}$, for ppsc3 we have $\beta_{0}+\beta_{3}$. But for ppsc4 we have $\beta_{0}$ - hence the intercept is the effect due to ppsc4, and each of the other betas is a comparison (adjustment) to ppsc4.

Now suppose we change the reference group to be ppsc2. Then we will have a new intercept $\beta_{0}^{(1)}=\beta_{0}+\beta_{2}$, and the effect for ppsc1 will be changed to $\beta_{0}^{(1)}+\beta_{1}^{(1)}=\beta_{0}+\beta_{1}$. Using this we have $\beta_{1}^{(1)}=\beta_{1}-\beta_{2}$, and similarly for the other effects. Because of invariance of MLEs, your estimates will satisfy these equations.

Solved – Convert SAS NLMIXED code for zero-inflated gamma regression to R

Having spent some time on this code, it appears to me as though it basically:

1) Does a logistic regression with right hand side b0_f + b1_f*x1 andy > 0 as a target variable,

2) For those observations for which y > 0, performs a regression with right hand side b0_h + b1_h*x1, a Gamma likelihood and link=log,

3) Also estimates the shape parameter of the Gamma distribution.

It maximizes the likelihood jointly, which is nice, because you only have to make the one function call. However, the likelihood separates anyway, so you don't get improved parameter estimates as a result.

Here is some R code which makes use of the glm function to save programming effort. This may not be what you'd like, as it obscures the algorithm itself. The code certainly isn't as clean as it could / should be, either.

McLerran <- function(y, x)
{
  z <- y > 0
  y.gt.0 <- y[y>0]
  x.gt.0 <- x[y>0]

  m1 <- glm(z~x, family=binomial)
  m2 <- glm(y.gt.0~x.gt.0, family=Gamma(link=log))

  list("p.ygt0"=m1,"ygt0"=m2)
}

# Sample data
x <- runif(100)
y <- rgamma(100, 3, 1)      # Not a function of x (coef. of x = 0)
b <- rbinom(100, 1, 0.5*x)  # p(y==0) is a function of x
y[b==1] <- 0

foo <- McLerran(y,x)
summary(foo$ygt0)

Call:
glm(formula = y.gt.0 ~ x.gt.0, family = Gamma(link = log))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.08888  -0.44446  -0.06589   0.28111   1.31066  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.2033     0.1377   8.737 1.44e-12 ***
x.gt.0       -0.2440     0.2352  -1.037    0.303    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Gamma family taken to be 0.3448334)

    Null deviance: 26.675  on 66  degrees of freedom
Residual deviance: 26.280  on 65  degrees of freedom
AIC: 256.42

Number of Fisher Scoring iterations: 6

The shape parameter for the Gamma distribution is equal to 1 / the dispersion parameter for the Gamma family. Coefficients and other stuff which you might like to access programatically can be accessed on the individual elements of the return value list:

> coefficients(foo$p.ygt0)
(Intercept)           x 
   2.140239   -2.393388

Prediction can be done using the output of the routine. Here's some more R code that shows how to generate expected values and some other information:

# Predict expected value
predict.McLerren <- function(model, x.new)
{
  x <- as.data.frame(x.new)
  colnames(x) <- "x"
  x$x.gt.0 <- x$x

  pred.p.ygt0 <- predict(model$p.ygt0, newdata=x, type="response", se.fit=TRUE)
  pred.ygt0 <- predict(model$ygt0, newdata=x, type="response", se.fit=TRUE)  

  p0 <- 1 - pred.p.ygt0$fit
  ev <- (1-p0) * pred.ygt0$fit

  se.p0 <- pred.p.ygt0$se.fit
  se.ev <- pred.ygt0$se.fit

  se.fit <- sqrt(((1-p0)*se.ev)^2 + (ev*se.p0)^2 + (se.p0*se.ev)^2)

  list("fit"=ev, "p0"=p0, "se.fit" = se.fit,
       "pred.p.ygt0"=pred.p.ygt0, "pred.ygt0"=pred.ygt0)
}

And a sample run:

> x.new <- seq(0.05,0.95,length=5)
> 
> foo.pred <- predict.McLerren(foo, x.new)
> foo.pred$fit
       1        2        3        4        5 
2.408946 2.333231 2.201889 2.009979 1.763201 
> foo.pred$se.fit
        1         2         3         4         5 
0.3409576 0.2378386 0.1753987 0.2022401 0.2785045 
> foo.pred$p0
        1         2         3         4         5 
0.1205351 0.1733806 0.2429933 0.3294175 0.4291541

Now for coefficient extraction and the contrasts:

coef.McLerren <- function(model)
{
  temp1 <- coefficients(model$p.ygt0)
  temp2 <- coefficients(model$ygt0)
  names(temp1) <- NULL
  names(temp2) <- NULL
  retval <- c(temp1, temp2)
  names(retval) <- c("b0.f","b1.f","b0.h","b1.h")
  retval
}

contrast.McLerren <- function(b0_f, b1_f, b2_f, b0_h, b1_h, b2_h)
{
  (1-(1 / (1 + exp(-b0_f -b1_f))))*(exp(b0_h + b1_h)) - (1-(1 / (1 + exp(-b0_f -b2_f))))*(exp(b0_h + b2_h))
}


> coef.McLerren(foo)
      b0.f       b1.f       b0.h       b1.h 
 2.0819321 -1.8911883  1.0009568  0.1334845

Best Answer

Related Solutions

Solved – Treating categorical variables in logistic regression in SAS

Solved – Convert SAS NLMIXED code for zero-inflated gamma regression to R

Related Question