Solved – How to use multiple regression when I have several levels for categorical variables? How to code for this

categorical datamultiple regression

My data is from a cross sectional study looking at pathogen status among 300 patients along with other clinical parameters. There is no control group. I am using Stata. (edit: This is a cross-sectional study; data was gathered post-discharge. Aside from descriptive statistics, I want to look for differences among patients who have staph as opposed to candida etc. Maybe those with candida have higher WBC, or creatinine. This seems feasible when it comes to variables with one level. If my question pertains to ABx- Was there a difference in number of ABx (10 levels) among those who have candida vs. staph?)

I have several clinical variables which include things like white blood count (continuous), gender (dichotomous), and antibiotic use(categorial with many levels) etc.

The dependent variable is pathogenic status for which there are 7 levels within the variable pathogen including "no pathogen"

Can I use multiple regression? I think I need dummy variables, but most of the patients have multiple groups within categorical variables. Such as, more than one antibiotic class, or more than one pathogen. How do I code for this? If someone took three classes of ABx would that person receive a 1 for all three?

Best Answer

Recommended general approach

Even if you ultimately want to build a model that predicts pathogen status as a "dependent variable" based on its relation to other variables, consider using it as an "independent variable" at this evidently exploratory stage of your cross-sectional post-discharge data analysis. That is, examine continuous variables as dependent variables with pathogen status treated as one of the independent variables. This is the most straightforward way to find whether "those with candida have higher WBC," for example. For relations among categorical variables, analyze contingency tables with chi-square or similar tests. These results may be easier to think about and explain to colleagues at this stage than the multiple logistic models suggested by one commentator, even if your study in the long run develops such models. But my suggestions for category coding apply however you proceed.

Coding categorical variables

The way to deal with the categorical variables, when an individual may fit into more than one category, depends to a great extent on your understanding of the subject matter and the types of results you might thus expect to get from this analysis. For example, do you expect WBC from someone with a combination of pathogens to represent a sum of relations from each of the pathogens, or something substantially different? Are influences of combinations of antibiotics likely to be the sums of their individual influences, or something substantially different?

In the first possibility for these types of questions, close to additive effects, simply coding and analyzing the individual categories will do. Yes, there will effectively have to be a different 0/1 indicator for each of the categories of antibiotic (or of pathogen) if an individual can have more than one. Yes to: "If someone took three classes of ABx would that person receive a 1 for all three?"

You might be able to use subject-matter knowledge to group together types of pathogens or types of antibiotics expected to have similar properties (or those often found/prescribed together in practice) and thus decrease the numbers of categories. To start exploration you might even want to start with the roughest breakdowns of all: any pathogen Yes/No; any antibiotic Yes/No. Start by lumping before you split too far.

If instead you expect substantially non-additive effects of categorical variables, you would have to code the data and run the analyses in a way that takes the combinations into account. Ideally this would be done as interaction terms among the categories but this becomes extremely difficult with 10 choices of antibiotics and 7 choices of pathogens. With only 300 cases you may not have enough data to be able to detect such combination effects even if they are present. If you can somehow group the pathogens and the antibiotics further into fewer larger groups, as suggested above, that might help. You will have no information about combinations that do not appear in your data set.