Solved – Are independent variables necessarily “independent” and how does this relate to what’s being predicted

independenceinferencepredictionterminology

I'm fairly new to statistics. I'm not clear on the meaning of independent and dependent variables and the relationship to what's being predicted.

In my text, as an example there is a data set containing many instances of the following:

  • a person's salary

  • a person's age

  • the year they earned that salary

  • their education level

The book mentions trying to predict their salary from the other three variables. Does this mean the other three are the independent variables and salary is the dependent variable?

When this data is arranged in a spreadsheet, with rows being people and columns being variables, something interesting appears. There is symmetry between all the variables. None of them holds a special place in the spreadsheet, they each have their own column.

Which then leads me to ask, could we pick another one, say age, and predict that from salary/year/education? Is age now the dependent variable?

In high school statistics I learned that independent variables have some degree of independence… say the weather is independent from what I have for dinner. There's not much effect that one has on the other.

But in statistics, can the independent variables be regarded as the "things we are using to make the prediction," while the dependent variable is the "thing being predicted?" Is there still a need for independent variables to really be independent in a real-world sense?

Best Answer

The questions "What do you want to predict?" and "What is the outcome or result here?" often have the same answer, but not always.

The terminology of independent variables is widely considered overloaded in statistical sciences. Numerous writers and researchers -- over at least the last several decades -- have suggested using other terms, although there is little consensus on what the best terms are. Some terms are predictors, explanatory variables, controlling variables, regressors, covariates, inputs, ....

The term dependent variable similarly is often substituted with something more evocative. For some time response seemed to lead the field of alternatives, but outcome and output have been among frequent recent terms. I note without enthusiasm the existence of regressand.

DV and IV are common abbreviations in some fields, sometimes seeming to tag initiates engaged by mutual consent in regression rituals. An objection to DV is that Deo volente remains a standard expansion for many people. A bigger objection to IV is that it is bespoke (by many economists in particular) for instrumental variable.

Still, the old terms linger on, and my impression (no names here) is that they are still often recommended in textbooks which on other grounds I regard as poor or incompetent.

Terminology aside: There is no absolute implication that so-called independent variables in a regression are statistically independent of each other, and indeed that fact is one of several objections to the terminology.

There are even situations in which predictors are deliberately introduced that are highly correlated with each other. Fitting a quadratic in $X$ and $X^2$ is a case in point, as $X$ and $X^2$ are not mutually independent. It's, however, foolish to include two predictors with essentially the same message, as say Fahrenheit and Celsius temperatures. In practice, good software has traps to detect that situation and drop predictors as needed, but the researcher still needs to be careful and thoughtful about their choice of predictors. The ideal -- easier to advise as a principle than to ensure in practice -- is for predictors to have a clear rationale and to use no more predictors than are needed for the purpose, and that are reasonable given the size of the dataset.

Your example is instructive. Usually salary depends on age, sometimes directly if an individual moves up a salary scale, but more often indirectly through salary being affected by promotion or moves to a different job and those being affected by greater experience, expertise, reputation, and so forth. Conversely, sometimes older people are less attractive to employ (e.g. sports people past their peak). But the crux is that a salary raise doesn’t affect age, whereas a change in age may affect salary (on average, which is what we care about here). Causal paths can exist in indirect ways.

All that said, in different problems age is unknown and the goal is to predict it. This is standard in archaeology, forensic sciences, and several Earth and environmental sciences.

Related Question