Solved – Artificial dataset generator for classification data

classificationdistributionsmachine learningr

I would like to generate some artificial data to evaluate an algorithm for classification (the algorithm induces a model that predicts posterior probabilities).

These are some basic properties of the dataset:

  • Features have to be continuous
  • Response variable is dichotomous (either 0 or 1)

I would like to test whether the algorithm can cope with:

  • Many feature / high dimensional problems
  • noise (it can drop features)
  • Multi-modality
  • ??? (how do I simulate correlation etc.)

I intend to implement the algorithm in R or Matlab. I can sample from multivariate normal distributions and specify a covariance matrix.

I would appreciate any feedback.

Best Answer

Some idea might be to generate something like the Madelon set from NIPS 2003 challenge; it fits your requirements pretty well.

You can generate a set like this starting with mlbench.xor (or mlbench.hypercube, might be easier) form mlbench package, then you combine classes it generated into two groups to make the dichotomous response and add new attributes to increase dimensionality -- some being random linear combinations of the original ones, some being just random noise.