Solved – Confounding variables in machine learning predictions

biasconfoundingdescriptive statisticsmachine learningpredictive-models

In classical statistics, confounding variable is a critical concept since it can distort our view about input variables and outcome variable's relationship. Many forms of control and adjustment are sought in statistics to eliminate, avoid or minimize the effect of confounding. For example, expected confounding variables (i.e., age and sex) are often included in the analysis, in the final model, the coefficient of your interested explanatory variable (i.e., treatment) is then adjusted for confounding (i.e., age and sex).

Confounding is not a frequent topic shows up in machine learning and predictive analysis. I wonder how confounding may (or may not) play an important role in machine learning algorithms. Does confounding potentially affect the accuracy of out-of-sample accuracy? Does including or not including an expected confounding variable play an important consideration when selecting as feature in machine learning?

Best Answer

Confounding plays a large role in statistics because we are looking to identify the exact effect of a set of variables on another. If confounding variables are left out of a statistical model then the effect measured for the variables that were included may be biased.

Confounding is not as a big a problem when performing prediction, because we are not concerned with identifying the exact effect of a variable on another. We are simply looking to find out what is the `most likely' value of a dependent variable given a set of predictors.

So for example, suppose that we would like to estimate to what a degree a person's age is effects their salary. So we can estimate the model: \begin{equation} \text{salary}_i = \beta_ 0 + \beta_1 \text{age}_i + \varepsilon_i. \end{equation} It is very likely that $\beta_1$ in the equation above will be positive and fairly large, because older people tend to have more education and more work experience. So if we wish pin-point the link between age and salary, we should probably control for these confounders, estimating the model: $$ \text{salary}_i = \beta_ 0 + \beta_1^* \text{age}_i + \beta_2 \text{education}_i + \beta_3\text{experience}_i + \varepsilon_i. $$ It is very likely that $\beta_1^* < \beta_1$ and that $\beta_1^*$ will be a much better estimator for the pure effect of age on one's earnings. That, in the sense of `change someone's age and keep EVERYTHING else fixed'. However, since age is highly correlated with education and experience, the first model might just be good enough for predicting a person's salary.