Solved – Two worlds collide: Using ML for complex survey data

machine learningsurvey-samplingsurvey-weights

I am struck with seemingly easy problem, but I haven't found a suitable solution for several weeks now.

I have quite a lot of poll/survey data (tens of thousands of respondents, say 50k per dataset), coming from something I hope is called complexly designed survey with weights, stratification, specific routing and so on. For each respondents, there are hundreds of variables such as demographics (age, region…) and then mostly binary (at most, categorical) variables.

I come more from computer science/machine learning background and I had to learn a lot about classical survey statistics and methodology. Now I want to apply classical machine learning to those data (e.g. predicting some missing values for subset of respondents – basically classification task). But, hold and behold, I cannot find a suitable way how to do that. How should I incorporate those stratas, weights or routing (like: if question 1 answered with option 2, ask question 3, otherwise skip it)?

Simply applying my models (trees, logistic regression, SVM, XGBoost…) seems dangerous (and, they fail in most cases), since they usually assume data are coming from simple random sample or iid.

A lot of methods at least have weights, but it doesn't help much. Furthermore, it is unclear how I should I combine imbalanced classes and weights given by survey definition together, not talking about those stratification stuff. Furthermore, result models should be well calibrated – the predicted distribution should be very close to the original one. Good performance of prediction isn't the only criteria here. I changed the optimisation metric to take into account this as well (such as distance of predicted distribution from the true distribution + accuracy/MCC) and it helped in some cases, why crippling the performance in others.

Is there some canonical way how to deal with this problem? It seems as a heavily underappreciated area of research for me. IMO many surveys could benefit from ML's power, but there are no sources. Like these are two worlds not interacting with each other.

What I have found so far:

Related CV questions, but none of them contains any usable answer how to approach this (either no answer, not what I ask for, or present misleading recommendations):

Best Answer

Update May 2022: In terms of accounting for survey weights, there's a nice pair of recent (2020?) articles on arXiv by Dagdoug, Goga, and Haziza. They list many ML-flavored methods and discuss how they have been / could be modified to incorporate weights, including kNN, splines, trees, random forests, XGBoost, BART, Cubist, SVN, principal component regression, and elastic net.

In terms of accounting for strata and clusters, and estimating predictive performance for the population when the data came from a complex sampling design, I humbly submit a recent article on "K-fold cross-validation for complex sample surveys," Wieczorek, Guerin, and McMahon (2022).

  • See this answer for a quick overview of how to create cross-validation folds that respect your first-stage clusters and strata. If you're an R user, you can use folds.svy() in our R package surveyCV.
  • Then use these folds to cross-validate your ML models as usual.
  • Finally, if you also have survey weights, calculate survey-weighted means of your CV test errors.

Our README has an example of doing this for parameter tuning with a random forest.


Update August 2017: There isn't very much work yet on "modern" ML methods with complex survey data, but the most recent issue of Statistical Science has a couple of review articles. See especially Breidt and Opsomer (2017), "Model-Assisted Survey Estimation with Modern Prediction Techniques".

Also, based on the Toth and Eltinge paper you mentioned, there is now an R package rpms implementing CART for complex-survey data.


Original answer October 2016:

Now I want to apply classical machine learning to those data (e.g. predicting some missing values for subset of respondents - basically classification task).

I'm not fully clear on your goal. Are you primarily trying to impute missing observations, just to have a "complete" dataset to give someone else? Or do you have complete data already, and you want to build a model to predict/classify new observations' responses? Do you have a particular question to answer with your model(s), or are you data-mining more broadly?

In either case, complex-sample-survey / survey-weighted logistic regression is a reasonable, pretty well-understood method. There's also ordinal regression for more than 2 categories. These will account for stratas and survey weights. Do you need a fancier ML method than this?

For example, you could use svyglm in R's survey package. Even if you don't using R, the package author, Thomas Lumley, also wrote a useful book "Complex Surveys: A Guide to Analysis Using R" which covers both logistic regression and missing data for surveys.

(For imputation, I hope you're already familiar with general issues around missing data. If not, look into approaches like multiple imputation to help you account for how the imputation step affects your estimates/predictions.)

Question routing is indeed an additional problem. I'm not sure how best to deal with it. For imputation, perhaps you can impute one "step" in the routing at a time. E.g. using a global model, first impute everyone's answer to "How many kids do you have?"; then run a new model on the relevant sub-population (people with more than 0 kids) to impute the next step of "How old are your kids?"

Related Question