Solved – How to analyze categorical variables with multiple levels

categorical datamultiple-comparisons

I have 14 different habitat classes and two activity states (so two variables – habitat and activity). For activity state A, I have a data count of over 300, but with B, I only have around 30 (so unequal sample sizes). I really want to see if one or more habitat/s is used significantly more than others during each activity state, but then I also want to compare habitat count between activity states, to see if they are significantly different in this regard. I have two other variables, site location and gender, and I would also like to see if the habitat count varies significantly between these two. Obviously it would also be nice to combine some of the variables, i.e., does habitat count vary between gender between sites, if this makes sense. I am currently using JMP8 as I have very limited experience with R. I have been using chi squared and contingency analysis (although a number of my cells have quite low counts, less than five, which chi squared struggles with).
What would be the best way to approach this statistically? How would I need to format the data (it is currently in text rather than count form, as there are so many classes in the habitat variable that it seems otherwise difficult to manage)? Any help would be hugely appreciated!!

Best Answer

Without wanting to sound critical, this is a case where I think most of the work needed is clarifying what questions you are wanting to answer, and designing your investigation accordingly. As best I can tell, the number of cases falling into each habitat class is your response variable, and you wonder if these counts vary under differing conditions. You seem primarily interested in whether these counts differ by site location or gender or possibly some combination of these covariates.

Let me state a few things. First, because you know the total, it is best to think of your data are proportions rather than counts. Although I don't think you want to use a chi-squared analysis, the potential threat to its validity is if expected counts under independence are less than 5, not observed counts. Finally, note that if any of your covariates matter in terms of the proportion of cases in the differing habitats, that means that the habitat classes are not equi-probable; I would ignore that question unless there are clearly no relationships amongst the variables.

What you want to look into is multinomial logistic regression. This is just a generalization of binary logistic regression (with which I assume you are familiar) to situations where there are more than two response categories. You might find some of the related threads on CV interesting; you can search them by clicking here: . I don't know JMP8, but the UCLA statistics consulting site has a user-friendly introduction to multinomial logistic regression in R.

Related Question