I am assuming that one of the statements is your outcome of interest; something like, "I am satisfied with my customer experience". You want to tie back both the responses to other questions, as well as demographic/transactional/profile information about your customers to this outcome statement. If so, your question sounds like what's often called "key driver analysis". In my experience, it's never 1 driver, it's multiple; and those drivers change for different customer profiles.
Do you have a hypothesized framework for what drives satisfaction? Do you believe that there is some unmeasured, latent influence on satisfaction that is expressed by the things you can measure? If so, you might use structural equation modeling or a confirmatory factor analysis to confirm or refute your hypotheses.
Otherwise, you might look at techniques such as partial least squares or principal components regression. These tend not to come from a preconceived hypothesis of how the world works. You may even learn a great deal simply by visualizing the correlations between the different survey item responses and your satisfaction outcome measurement- no formal model needed.
Please note that unless your key drivers are uncorrelated- which is highly unlikely- you will have to deal with untangling multicollinearity in your analysis. That is, if customer satisfaction is correlated with both price and packaging, but price and packaging are correlated with each other, you'll need an approach that either exploits or accounts for the correlation between price and packing. The 4 approaches I listed above, which are certainly not exhaustive, have different ways of dealing with multicollinearity.
For a gentle introduction to customer satisfaction analysis and psychographics, I like this author's work. Please note I don't know him, and am in no way affiliated, I just like this style and he does examples in R (which I also use). He has postings on network analysis of key drivers, structural equation modeling, and relative importance of drivers, among others.
Update May 2022: In terms of accounting for survey weights, there's a nice pair of recent (2020?) articles on arXiv by Dagdoug, Goga, and Haziza. They list many ML-flavored methods and discuss how they have been / could be modified to incorporate weights, including kNN, splines, trees, random forests, XGBoost, BART, Cubist, SVN, principal component regression, and elastic net.
In terms of accounting for strata and clusters, and estimating predictive performance for the population when the data came from a complex sampling design, I humbly submit a recent article on "K-fold cross-validation for complex sample surveys," Wieczorek, Guerin, and McMahon (2022).
- See this answer for a quick overview of how to create cross-validation folds that respect your first-stage clusters and strata. If you're an R user, you can use
folds.svy()
in our R package surveyCV
.
- Then use these folds to cross-validate your ML models as usual.
- Finally, if you also have survey weights, calculate survey-weighted means of your CV test errors.
Our README has an example of doing this for parameter tuning with a random forest.
Update August 2017: There isn't very much work yet on "modern" ML methods with complex survey data, but the most recent issue of Statistical Science has a couple of review articles.
See especially Breidt and Opsomer (2017), "Model-Assisted Survey Estimation with Modern Prediction Techniques".
Also, based on the Toth and Eltinge paper you mentioned, there is now an R package rpms implementing CART for complex-survey data.
Original answer October 2016:
Now I want to apply classical machine learning to those data (e.g. predicting some missing values for subset of respondents - basically classification task).
I'm not fully clear on your goal.
Are you primarily trying to impute missing observations, just to have a "complete" dataset to give someone else? Or do you have complete data already, and you want to build a model to predict/classify new observations' responses? Do you have a particular question to answer with your model(s), or are you data-mining more broadly?
In either case, complex-sample-survey / survey-weighted logistic regression is a reasonable, pretty well-understood method. There's also ordinal regression for more than 2 categories. These will account for stratas and survey weights. Do you need a fancier ML method than this?
For example, you could use svyglm
in R's survey
package. Even if you don't using R, the package author, Thomas Lumley, also wrote a useful book "Complex Surveys: A Guide to Analysis Using R" which covers both logistic regression and missing data for surveys.
(For imputation, I hope you're already familiar with general issues around missing data. If not, look into approaches like multiple imputation to help you account for how the imputation step affects your estimates/predictions.)
Question routing is indeed an additional problem. I'm not sure how best to deal with it. For imputation, perhaps you can impute one "step" in the routing at a time. E.g. using a global model, first impute everyone's answer to "How many kids do you have?"; then run a new model on the relevant sub-population (people with more than 0 kids) to impute the next step of "How old are your kids?"
Best Answer
I work for a health care company on our member satisfaction team where weights are constantly applied to match the sample to the populations of our service regions. This is very important for interpretable modeling that aims to explain magnitude of relationships between variables. We also use a lot of ML for other tasks, but it seems like you may be wondering if this is important when using machine learning for prediction.
As you hinted most machine learning techniques were not developed for the purpose of explaining relationships, but for predictive purposes. While a representative sample is important, it may not be critical..until your performance tanks.
If algorithms have sufficient samples to learn respondent types, they will be able to predict new respondents' class (classification) / value (regression) well. For example if you had a data set with 4 variables, height, weight, sex, and age, your algorithm of choice will learn certain types of a person based of these characteristics. Say most people in the population are female, 5'4", 35 years old, and 130 pounds (not fact, just roll with it) and we are trying to predict gender. Now say my sample has a low representation of this demographic proportionally, yet still has a high enough number (N) of this type of person. Our model has learned what that type of person looks like though that type of person is not well represented in my sample. When our model sees a new person with those characteristics it will have learned which label (gender) is most associated with said person. If our sample shows that those characteristics are more related to females than males and this matches the population then all is well. The problem arises when the sample's outcome variable does not represent the population by so much that it predicts a different class / value.
So when it comes down to it, testing your predictive ML model on representative data is where you can find out if you have a problem. However, I think it would be fairly rare to sample in such a biased way that prediction would suffer greatly. If accuracy / kappa statistic / AUC is low or RMSE is high when testing then you might want to shave off those people that over-represent demographics of interest given you have enough data.