Solved – building a classification model for strictly binary data

bayesian networkclassificationmachine learningrandom forestsvm

i have a data set that is strictly binary. each variable's set of values is in the domain: true, false.

the "special" property of this data set is that an overwhelming majority of the values are "false".

i have already used a bayesian network learning algorithm to learn a network from the data. however, for one of my target nodes (the most important one, being death), the AUC result is not very good; it is slightly better than chance. even the positive-predictive value (PPV), which has been suggested to me on CV, was not competitive with what's reported in the literature with other approaches. note that AUC (ROC analysis) is the typical benchmark reported in this area of clinical research, but i am also opened to suggestions on how to more appropriately benchmark the classification model if there are any other ideas.

so, i was wondering what other classification models i can try for this type of data set with this property (mostly false values).

  • would support vector machine help? as far as i know, SVM only deals with continuous – variables as the predictors (although it has been adapted to multi-class). but my variables are all binary.
  • would a random forest help?
  • would logistic regression apply here? as far as i know, the predictors in logistic regression also are continuous. is there a generalized version for binary variables as the predictors?

aside from classification performance, i suspect SVM and random forest might very well outperform the bayesian network, but the problem shifts to how to explain the relationships in these models (especially to clinicians).

Best Answer

would support vector machine help? as far as i know, SVM only deals with continuous - variables as the predictors ...

Binary variables are not a problem for SVM. Even specialized kernels exist for exactly such data (Hamming kernel, Tanimoto/Jaccard kernel), though I don't recommend using those if you're not intimately familiar with kernel methods.

would logistic regression apply here? as far as i know, the predictors in logistic regression also are continuous

Logistic regression works with binary predictors. It is probably your best option.

how to explain the relationships in these models (especially to clinicians).

If you use linear SVM it is fairly straightforward to explain what's going on. Logistic regression is a better option, though, since most clinicials actually know these models (and by know I mean have heard of).

Related Question