I have a dataset which contains only categorical data i.e.A,B,C,D
(like factors) for each predictor. There are 10 predictors and the dependent variable is binary, 0,1
.
UPDATE: MY predictors are answers for multiple choice questions for a questionnaire. So each predictor only takes on categorical values, i.e. X_1
can be A,B,C
or D
, X_2
can be A,B,C,D,E,F,G
or H
.
Is it feasible to fit a logistic regression over this dataset?
Ideally, if I can fit a logistic regression the data, I will then use it for prediction over a set of test data, which again contains only categorical data.
What are the pitfalls that I should look out for?
Best Answer
Yes of course you can. Just be aware of the nature of your categorical data - is it ordered or unordered?
If ordered (e.g. small, medium, large) you might want a single feature X1 with values like (1, 1, 3, 2, 3, 1, ...) where 1 represents small, 2 represents medium, etc.
If unordered (e.g. red, blue, green) you'll want multiple features like X1 = (0, 0, 1, 0) representing "is red?", X2 = (1, 0, 0, 1) representing "is blue?" and so forth.