Solved – How to interpret logistic regression output for categorical variables when two categories are missing

binary datacategorical datainterpretationlogisticspss

I am using binary logistic regression; the dependent variable is 1 or 0; the independent variables are two groups: the first group includes continuous variables (LNTA: logarthim of total assets, ROA: return on assets, and Leverage; the second group includes categorical variables (Type of auditor: 1 or 0, Industry sector: 1,2,3, and 4, country: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10), and finally region: 1 or 2. The problem is that when I put all these independent variables together, I get results only for 8 countries not for 9 countries. I know that I have 10 countries and in the results table only 9 countries will appear as the country number 10 is reference category; but this case is different because both countries 9 and 10 are not included in the table.

Best Answer

This was going to be a comment asking for clarification, but I wanted to give a screenshot.

A quick question (which you might already know the answer to) -- do you have missing data from any of your variables? I'd suspect the most likely culprit is that you have at least one missing datapoint for all of the observations from "Country 9" and hence all of the Country 9 observations are excluded from analysis?

Running Logistic Regression in SPSS should start off with a "Case Processing Summary" table that will answer this for you. Here's an example from a dataset with no missing variables (I just blanked out the raw data filename).

enter image description here

EDIT: Example two, parameter specification. Just in case!

Another issue with SPSS is that the "Parameter Coding" doesn't necessarily correspond to your original values.

e.g. in the "Variables in the Equation" table, Country(7) doesn't necessarily mean the Country with the numerical value of 7, but rather the seventh parameter associated with the Country factor. You should check the "Categorical Variables Codings" table to make sure that all nine countries are showing up in that list.

In the example figure below, I mocked up a dataset with five countries and two regions. All values of the outcome for country 3 were set to missing (but all values of Region were complete). Country 3 is skipped from the parameter coding -- but you'll see that Country(3) [the third column of coding for the Country factor] actually pertains to Country==4 in the dataset.

Second set of tables showing parameter specification

Related Solutions

Regression – How to Determine Number of Levels and Combine Categories in Logistic Regression

You can add as many categories as you like as long as you do not run into problems like a perfect seperation. Also, as you add more levels, you will typically loose statistical power. So adding levels is not free.

As to binning, that depends on the substance. Take occupation code: There are many class schemes like the EGP classes (Erikson, Goldthorpe, Portocarero 1979) or micro classes (Weeden and Grusky 2005). You could also transform occupational codes to a measure of occupational status like the ISEI (Ganzeboom, De Graaf and Treiman 1992), and add that linearly. There are long debates on which one is best, but in essence they just represent different theories and measure slightly different things. So, whichever is best depends on what your question is.

R. Erikson, J. H. Goldthorpe, L. Portocarero (1979): Intergenerational class mobility in three Western European societies: England, France and Sweden. In: British Journal of Sociology 30 (1979). S. 341 – 415.

Ganzeboom, H. B., De Graaf, P. M., & Treiman, D. J. (1992). A standard international socio-economic index of occupational status. Social science research, 21(1), 1-56.

Weeden, K. A., & Grusky, D. B. (2005). The Case for a New Class Map. American Journal of Sociology, 111(1), 141-212.

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Best Answer

Related Solutions

Regression – How to Determine Number of Levels and Combine Categories in Logistic Regression

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Related Question