Solved – Pros and cons of controlling for age when examining effect of cognition on driving performance

agecontrolling-for-a-variable

I am reviewing some studies on the relationship between cognition and driving. Some studies have controlled for age and some have not (age is correlated with cognition. Can anyone please provide any good references on statistical control I can use to explain the positives and negatives of controlling for age?

Best Answer

There are several purposes for statistical control.

You may want to control for age in order to see the incremental prediction of age on driving.
You may want to estimate the causal effect of cognition on driving and you believe that age may contaminate the estimate of that causal effect.

Given the research context, I imagine that estimating the causal effect would be the main interest.

Evaluating the context:

Age on cognition: It may help to first think about each of the variables and what empirical research says about the causal nature of the relationships. For example, research shows that cognition (or at least IQ) increases up to early adulthood (i.e., around 18 to 20) and then remains stable until late adult (e.g., see the Seattle Longitudinal Study) where there tends to be a progressive decline. This process is theorised to be driven by various processes of maturation and ultimately physical-mental decline. Between subjects studies are also known to overestimate within-subjects causal effects of age on cognition.
Age on driving performance: Initially, driving performance is largely influenced by amount of driving experience. Thus, in early adulthood age is largely a proxy for early experience, but presumably it also may capture maturity and risk taking factors as well. So for example, if someone has never driven a car and they are 30 years of age, they will be a terrible driver initially because everyone is a terrible driver if they have never driven a car before. At late adulthood, age is more likely to have a causal effect through mechanisms like slower reaction time, poorer sensory-motor performance, and so on. Thus, the relationship between age and driving performance is inverted-u shaped.
Controlling for age: Based on the above analysis, controlling for age has complex effects. First, if the theory above is correct controlling for age should not be done in a linear way, although that is the typical approach. Second, if age affects driving performance through cognition, then you don't want to adjust for age because age is just a distal cause. However, if age affects driving performance through physical and psychomotor skill decline, then cognition may just be correlated with that other decline and may not actually be the causal mechanism. You would also need to think about how these studies have dealt with the huge effect of driving experience. Presumably that is a more immediate factor that would need to be controlled for.

With regards to specific advice:

Think about causal mechanisms. Use the literature to understand both the associations and the theorised causal mechanisms between the variables.
If you feel that cognition is correlated with age and age is affecting driving performance through mechanisms other than cognition, then you may want to control for age. However, if you feel that age's effect on driving performance is through cognition, then you may not want to control for age.
More generally, you may want to think of the set of causal predictors and how they can be entered into a model. I.e., it's not just about controlling for age, it's about controlling for a range of predictors where careful consideration is given to where the variable fits into the causal system.
Researchers should also report analyses with and without statistically controlling for covariates. One, this allows others to assess whether the covariates make a difference. Two, if others disagree with the researchers decision to include the covariate, they can see what the relationship is without the covariate.

References

In terms of a reference, I'm not exactly sure what would be best, and I'd be interested in seeing what others say. Perhaps you could look at the APA statistical task force on statistics. It has a lot of great advice, but in particular, see the section on causality:

Causality. Inferring causality from nonrandomized designs is a risky enterprise. Researchers using nonrandomized designs have an extra obligation to explain the logic behind covariates included in their designs and to alert the reader to plausible rival hypotheses that might explain their results...

Related Solutions

Solved – How exactly does one “control for other variables”

There are many ways to control for variables.

The easiest, and one you came up with, is to stratify your data so you have sub-groups with similar characteristics - there are then methods to pool those results together to get a single "answer". This works if you have a very small number of variables you want to control for, but as you've rightly discovered, this rapidly falls apart as you split your data into smaller and smaller chunks.

A more common approach is to include the variables you want to control for in a regression model. For example, if you have a regression model that can be conceptually described as:

BMI = Impatience + Race + Gender + Socioeconomic Status + IQ

The estimate you will get for Impatience will be the effect of Impatience within levels of the other covariates - regression allows you to essentially smooth over places where you don't have much data (the problem with the stratification approach), though this should be done with caution.

There are yet more sophisticated ways of controlling for other variables, but odds are when someone says "controlled for other variables", they mean they were included in a regression model.

Alright, you've asked for an example you can work on, to see how this goes. I'll walk you through it step by step. All you need is a copy of R installed.

First, we need some data. Cut and paste the following chunks of code into R. Keep in mind this is a contrived example I made up on the spot, but it shows the process.

covariate <- sample(0:1, 100, replace=TRUE)
exposure  <- runif(100,0,1)+(0.3*covariate)
outcome   <- 2.0+(0.5*exposure)+(0.25*covariate)

That's your data. Note that we already know the relationship between the outcome, the exposure, and the covariate - that's the point of many simulation studies (of which this is an extremely basic example. You start with a structure you know, and you make sure your method can get you the right answer.

Now then, onto the regression model. Type the following:

lm(outcome~exposure)

Did you get an Intercept = 2.0 and an exposure = 0.6766? Or something close to it, given there will be some random variation in the data? Good - this answer is wrong. We know it's wrong. Why is it wrong? We have failed to control for a variable that effects the outcome and the exposure. It's a binary variable, make it anything you please - gender, smoker/non-smoker, etc.

Now run this model:

lm(outcome~exposure+covariate)

This time you should get coefficients of Intercept = 2.00, exposure = 0.50 and a covariate of 0.25. This, as we know, is the right answer. You've controlled for other variables.

Now, what happens when we don't know if we've taken care of all of the variables that we need to (we never really do)? This is called residual confounding, and its a concern in most observational studies - that we have controlled imperfectly, and our answer, while close to right, isn't exact. Does that help more?

Solved – controlling for a highly correlated variable

You're talking about multicollinearity (in the model inputs, e.g., hand movements and time). The problem does not impact the reliability of a model overall. We can still reliably interpret the coefficient and standard errors on our treatment variable. The negative side of multicollinearity is that we can no longer interpret the coefficient and standard error on the highly correlated control variables. But if we are being strict in conceiving of our regression model as a notional experiment, where we want to estimate the effect of one treatment (T) on one outcome (Y), considering the other variables (X) in our model as controls (and not as estimable quantities of causal interest), then regressing on highly correlated variables is fine.

Another fact that may be thinking about is that if two variables are perfectly multicollinear, then one will be dropped from any regression model that includes them both.

For more, see: See http://en.wikipedia.org/wiki/Multicollinearity

Best Answer

References

Related Solutions

Solved – How exactly does one “control for other variables”

Solved – controlling for a highly correlated variable

Related Question