Machine Learning – How to Work with Weighted and Complex Survey Data

machine learningstratificationsurveyweighted-sampling

I have worked a lot with various nationally representative data. These data sources have a complex survey design, so the analysis requires the specification of stratification and weight variables. Among the data sources that are within my area of study, machine learning tools have not been applied to them. One obvious reason is that machine learning methods (currently) do not take into account weight and stratification variables.

The goal of the weighted / stratified analyses is to obtain adjusted population estimates, which is different than the goal / purpose of machine learning. What thoughts do people have about using the nationally representative data sources and ignoring the weight and stratification variables? In other words, what would be your thoughts if you reviewing a machine learning study that was used nationally representative data but ignored the weight and stratification variables, assuming that the researcher / author was up-front about this methodological decision and was not making claims of nationally representative results?

Thanks in advance!

Best Answer

I work for a health care company on our member satisfaction team where weights are constantly applied to match the sample to the populations of our service regions. This is very important for interpretable modeling that aims to explain magnitude of relationships between variables. We also use a lot of ML for other tasks, but it seems like you may be wondering if this is important when using machine learning for prediction.

As you hinted most machine learning techniques were not developed for the purpose of explaining relationships, but for predictive purposes. While a representative sample is important, it may not be critical..until your performance tanks.

If algorithms have sufficient samples to learn respondent types, they will be able to predict new respondents' class (classification) / value (regression) well. For example if you had a data set with 4 variables, height, weight, sex, and age, your algorithm of choice will learn certain types of a person based of these characteristics. Say most people in the population are female, 5'4", 35 years old, and 130 pounds (not fact, just roll with it) and we are trying to predict gender. Now say my sample has a low representation of this demographic proportionally, yet still has a high enough number (N) of this type of person. Our model has learned what that type of person looks like though that type of person is not well represented in my sample. When our model sees a new person with those characteristics it will have learned which label (gender) is most associated with said person. If our sample shows that those characteristics are more related to females than males and this matches the population then all is well. The problem arises when the sample's outcome variable does not represent the population by so much that it predicts a different class / value.

So when it comes down to it, testing your predictive ML model on representative data is where you can find out if you have a problem. However, I think it would be fairly rare to sample in such a biased way that prediction would suffer greatly. If accuracy / kappa statistic / AUC is low or RMSE is high when testing then you might want to shave off those people that over-represent demographics of interest given you have enough data.

Related Question