I am reviewing some studies on the relationship between cognition and driving. Some studies have controlled for age and some have not (age is correlated with cognition. Can anyone please provide any good references on statistical control I can use to explain the positives and negatives of controlling for age?
Solved – Pros and cons of controlling for age when examining effect of cognition on driving performance
agecontrolling-for-a-variable
Related Solutions
There are many ways to control for variables.
The easiest, and one you came up with, is to stratify your data so you have sub-groups with similar characteristics - there are then methods to pool those results together to get a single "answer". This works if you have a very small number of variables you want to control for, but as you've rightly discovered, this rapidly falls apart as you split your data into smaller and smaller chunks.
A more common approach is to include the variables you want to control for in a regression model. For example, if you have a regression model that can be conceptually described as:
BMI = Impatience + Race + Gender + Socioeconomic Status + IQ
The estimate you will get for Impatience will be the effect of Impatience within levels of the other covariates - regression allows you to essentially smooth over places where you don't have much data (the problem with the stratification approach), though this should be done with caution.
There are yet more sophisticated ways of controlling for other variables, but odds are when someone says "controlled for other variables", they mean they were included in a regression model.
Alright, you've asked for an example you can work on, to see how this goes. I'll walk you through it step by step. All you need is a copy of R installed.
First, we need some data. Cut and paste the following chunks of code into R. Keep in mind this is a contrived example I made up on the spot, but it shows the process.
covariate <- sample(0:1, 100, replace=TRUE)
exposure <- runif(100,0,1)+(0.3*covariate)
outcome <- 2.0+(0.5*exposure)+(0.25*covariate)
That's your data. Note that we already know the relationship between the outcome, the exposure, and the covariate - that's the point of many simulation studies (of which this is an extremely basic example. You start with a structure you know, and you make sure your method can get you the right answer.
Now then, onto the regression model. Type the following:
lm(outcome~exposure)
Did you get an Intercept = 2.0 and an exposure = 0.6766? Or something close to it, given there will be some random variation in the data? Good - this answer is wrong. We know it's wrong. Why is it wrong? We have failed to control for a variable that effects the outcome and the exposure. It's a binary variable, make it anything you please - gender, smoker/non-smoker, etc.
Now run this model:
lm(outcome~exposure+covariate)
This time you should get coefficients of Intercept = 2.00, exposure = 0.50 and a covariate of 0.25. This, as we know, is the right answer. You've controlled for other variables.
Now, what happens when we don't know if we've taken care of all of the variables that we need to (we never really do)? This is called residual confounding, and its a concern in most observational studies - that we have controlled imperfectly, and our answer, while close to right, isn't exact. Does that help more?
You're talking about multicollinearity (in the model inputs, e.g., hand movements and time). The problem does not impact the reliability of a model overall. We can still reliably interpret the coefficient and standard errors on our treatment variable. The negative side of multicollinearity is that we can no longer interpret the coefficient and standard error on the highly correlated control variables. But if we are being strict in conceiving of our regression model as a notional experiment, where we want to estimate the effect of one treatment (T) on one outcome (Y), considering the other variables (X) in our model as controls (and not as estimable quantities of causal interest), then regressing on highly correlated variables is fine.
Another fact that may be thinking about is that if two variables are perfectly multicollinear, then one will be dropped from any regression model that includes them both.
For more, see: See http://en.wikipedia.org/wiki/Multicollinearity
Best Answer
There are several purposes for statistical control.
Given the research context, I imagine that estimating the causal effect would be the main interest.
Evaluating the context:
Age on cognition: It may help to first think about each of the variables and what empirical research says about the causal nature of the relationships. For example, research shows that cognition (or at least IQ) increases up to early adulthood (i.e., around 18 to 20) and then remains stable until late adult (e.g., see the Seattle Longitudinal Study) where there tends to be a progressive decline. This process is theorised to be driven by various processes of maturation and ultimately physical-mental decline. Between subjects studies are also known to overestimate within-subjects causal effects of age on cognition.
Age on driving performance: Initially, driving performance is largely influenced by amount of driving experience. Thus, in early adulthood age is largely a proxy for early experience, but presumably it also may capture maturity and risk taking factors as well. So for example, if someone has never driven a car and they are 30 years of age, they will be a terrible driver initially because everyone is a terrible driver if they have never driven a car before. At late adulthood, age is more likely to have a causal effect through mechanisms like slower reaction time, poorer sensory-motor performance, and so on. Thus, the relationship between age and driving performance is inverted-u shaped.
Controlling for age: Based on the above analysis, controlling for age has complex effects. First, if the theory above is correct controlling for age should not be done in a linear way, although that is the typical approach. Second, if age affects driving performance through cognition, then you don't want to adjust for age because age is just a distal cause. However, if age affects driving performance through physical and psychomotor skill decline, then cognition may just be correlated with that other decline and may not actually be the causal mechanism. You would also need to think about how these studies have dealt with the huge effect of driving experience. Presumably that is a more immediate factor that would need to be controlled for.
With regards to specific advice:
References
In terms of a reference, I'm not exactly sure what would be best, and I'd be interested in seeing what others say. Perhaps you could look at the APA statistical task force on statistics. It has a lot of great advice, but in particular, see the section on causality: