Regression – How to Determine Number of Levels and Combine Categories in Logistic Regression

categorical datalogisticregression

I have an independent variable “State in the US” and it has around 16 different levels (PA, NY, NJ, etc). Now I have combined the states which appear infrequently into "others" category. But I still have around 12 levels which are equally populated.

I have a very large data set (1 million observations) and no zero cell issues with the variables.

Questions:

I would like to know if there was a limit on the number of levels a categorical variable can have when used with logistic regression. If yes, why is it the case?
And what is the best way to bin categorical variables like Zip code or occupation code, apart from binning similar levels together?

Best Answer

You can add as many categories as you like as long as you do not run into problems like a perfect seperation. Also, as you add more levels, you will typically loose statistical power. So adding levels is not free.

As to binning, that depends on the substance. Take occupation code: There are many class schemes like the EGP classes (Erikson, Goldthorpe, Portocarero 1979) or micro classes (Weeden and Grusky 2005). You could also transform occupational codes to a measure of occupational status like the ISEI (Ganzeboom, De Graaf and Treiman 1992), and add that linearly. There are long debates on which one is best, but in essence they just represent different theories and measure slightly different things. So, whichever is best depends on what your question is.

R. Erikson, J. H. Goldthorpe, L. Portocarero (1979): Intergenerational class mobility in three Western European societies: England, France and Sweden. In: British Journal of Sociology 30 (1979). S. 341 – 415.

Ganzeboom, H. B., De Graaf, P. M., & Treiman, D. J. (1992). A standard international socio-economic index of occupational status. Social science research, 21(1), 1-56.

Weeden, K. A., & Grusky, D. B. (2005). The Case for a New Class Map. American Journal of Sociology, 111(1), 141-212.

Best Answer

Related Solutions

Solved – How to interpret logistic regression output for categorical variables when two categories are missing

Solved – How to fit OLS with many categorical levels, on more than one category

Related Question