Logit – Logit with Ordinal Independent Variables

logisticlogitordinal-data

In a logit model, is there a smarter way to determine the effect of an independent ordinal variable than to use dummy variables for each level?

Best Answer

To add to @dmk38's response, "any set of scores gives a valid test, provided they are constructed without consulting the results of the experiment. If the set of scores is poor, in that it badly distorts a numerical scale that really does underlie the ordered classification, the test will not be sensitive. The scores should therefore embody the best insight available about the way in which the classification was constructed and used." (Cochran, 1954, cited by Agresti, 2002, pp. 88-89). In other words, treating an ordered factor as a numerically scored variable is merely a modelling issue. Provided it makes sense, this will only impact the way you interpret the result, and there is no definitive rule of thumb on how to choose the best representation for an ordinal variable.

Consider the following example on maternal Alcohol consumption and presence or absence of congenital malformation (Agresti, Categorical Data Analysis, Table 3.7 p.89):

            0    <1 1-2 3-5 6+
Absent  17066 14464 788 126 37
Present    48    38   5   1  1

In this particular case, we can model the outcome using logistic regression or simple association table. Let's do it in R:

tab3.7 <- matrix(c(17066,48,14464,38,788,5,126,1,37,1), nr=2,
                 dimnames=list(c("Absent","Present"),
                               c("0","<1","1-2","3-5","6+")))
library(vcd)
assocstats(tab3.7)

Usual $\chi^2$ (12.08, p=0.016751) or LR (6.20, p=0.184562) statistic (with 4 df) do not account for the ordered levels in Alcohol consumption.

Treating both variables as ordinal with equally spaced scores (this has no impact for binary variables, like malformation, and we choose the baseline as 0=absent), we could test for a linear by linear association. Let's first construct an exploded version of this contingency Table:

library(reshape)
tab3.7.df <- untable(data.frame(malform=gl(2,1,10,labels=0:1), 
                                alcohol=gl(5,2,10,labels=colnames(tab3.7))), 
                     c(tab3.7))
# xtabs(~malform+alcohol, tab3.7.df) # check

Then we can test for a linear association using

library(coin)
#lbl_test(as.table(tab3.7))
lbl_test(malform ~ alcohol, data=tab3.7.df)

which yields $\chi^2(1)=1.83$ with $p=0.1764$. Note that this statistic is simply the correlation between the two series of scores (that Agresti called $M^2=(n-1)r^2$), which is readily computed as

cor(sapply(tab3.7.df, as.numeric))[1,2]^2*(32574-1)

As can be seen, there is not much evidence of a clear association between the two variables. As done by Agresti, if we choose to recode Alcohol levels as {0,0.5,1.5,4,7}, that is using mid-range values for an hypothesized continuous scale with the last score being somewhat purely arbitrary, then we would conclude to a larger effect of maternal Alcohol consumption on the development of congenital malformation:

lbl_test(malform ~ alcohol, data=tab3.7.df,         
         scores=list(alcohol=c(0,0.5,1.5,4,7)))

yields a test statistic of 6.57 with an associated p-value of 0.01037.

There are alternative coding schemes, including midranks (in which case, we fall back to Spearman $\rho$ instead of Pearson $r$) that is discussed by Agresti, but I hope you catch the general idea here: It is best to select scores that actually reflect a reasonable measures of the distance between adjacent categories of your ordinal variable, and equal spacing is often a good compromise (in the absence of theoretical justification).

Using the GLM approach, we would proceed as follows. But first check how Alcohol is encoded in R:

class(tab3.7.df$alcohol)

It is a simple unordered factor ("factor"), hence a nominal predictor. Now, here are three models were we consider Alcohol as a nominal, ordinal or continuous predictor.

summary(mod1 <- glm(malform ~ alcohol, data=tab3.7.df, 
                    family=binomial))
summary(mod2 <- glm(malform ~ ordered(alcohol), data=tab3.7.df, 
                    family=binomial))
summary(mod3 <- glm(malform ~ as.numeric(alcohol), data=tab3.7.df, 
                    family=binomial))

The last case implicitly assumes an equal-interval scale, and the $\hat\beta$ is interpreted as @dmk38 did: it reflects the effect of a one-unit increase in Alcohol on the outcome through the logit link, that is the increase in probability of observing a malformation (compared to no malformation, i.e. the odds-ratio) is $\exp(\hat\theta)=\exp(0.228)=1.256$. The Wald test is not significant at the usual 5% level. In this case, the design matrix only includes 2 columns: the first is a constant column of 1's for the intercept, the second is the numerical value (1 to 5) for the predictor, as in a simple linear regression. In sum, this model tests for a linear effect of Alcohol on the outcome (on the logit scale).

However, in the two other cases (mod1 and mod2), we get different output because the design matrix used to model the predictor differs, as can be checked by using:

model.matrix(mod1)
model.matrix(mod2)

We can see that the associated design matrix for mod1 includes dummy variables for the $k-1$ levels of Alcohol (0 is always the baseline) after the intercept term in the first column, whereas in the case of mod2 we have four columns of contrast-coded effects (after the column of 1's for the intercept). The coefficient for the category "3-5" is estimated at 1.03736 under mod1, but 0.01633 under mod2. Note that AIC and other likelihood-based measures remain, however, identical between these two models.

You can try assigning new scores to Alcohol and see how it will impact the predicted probability of a malformation.