Solved – controlling for a highly correlated variable

controlling-for-a-variablecorrelation

In a paper looking at hand movements and time taken to complete a task it appears that the ratio of movements over time is constant (i.e., movement and time are highly correlated) yet they say in the paper they can control independently for both variable – how is this possible?

There is a strong relationship between the time taken and number of
hand movements made (Spearman coefficient 0.79, P <0.01). This has
been demonstrated with ICSAD before. Therefore, why not just time the
procedure with a stopwatch? This is answered when we apply partial
correlation coefficient tests. When controlling for time, the number
of movements made significantly compares with surgical experience and
global score (correlation coefficient −0.44 and 0.56, respectively, P
<0.01 for both). However, when controlling for movement, the time
taken had no such relationship with experience and global rating
(correlation coefficient −0.02, P = 0.9; 0.10, P = 0.8, respectively),
suggesting that operative speed is secondary to economy of hand
movement.

Datta V, Chang A, Mackay S, Darzi A. The relationship between motion analysis and surgical technical assessments. Am J Surg. 2002 Jul;184(1):70-3.

Best Answer

You're talking about multicollinearity (in the model inputs, e.g., hand movements and time). The problem does not impact the reliability of a model overall. We can still reliably interpret the coefficient and standard errors on our treatment variable. The negative side of multicollinearity is that we can no longer interpret the coefficient and standard error on the highly correlated control variables. But if we are being strict in conceiving of our regression model as a notional experiment, where we want to estimate the effect of one treatment (T) on one outcome (Y), considering the other variables (X) in our model as controls (and not as estimable quantities of causal interest), then regressing on highly correlated variables is fine.

Another fact that may be thinking about is that if two variables are perfectly multicollinear, then one will be dropped from any regression model that includes them both.

For more, see: See http://en.wikipedia.org/wiki/Multicollinearity

Related Solutions

Solved – the difference between controlling for a variable in a regression model vs. controlling for a variable in your study design

By "controlling for a variable in your study design", I assume you mean causing a variable to be constant across all study units or manipulating a variable so that the level of that variable is independently set for each study unit. That is, controlling for a variable in your study design means that you are conducting a true experiment. The benefit of this is that it can help with inferring causality.

In theory, controlling for a variable in your regression model can also help with inferring causality. However, this is only the case if you control for every variable that has a direct causal connection to the response. If you omit such a variable (perhaps you didn't know to include it), and it is correlated with any of the other variables, then your causal inferences will be biased and incorrect. In practice, we don't know all the relevant variables, so statistical control is a fairly dicey endeavor that relies on big assumptions you can't check.

However, your question asks about "reducing error and yielding more precise predictions", not inferring causality. This is a different issue. If you were to make a given variable constant through your study design, all of the variability in the response due to that variable would be eliminated. On the other hand, if you simply control for a variable, you are estimating its effect which is subject to sampling error at a minimum. In other words, statistical control wouldn't be quite as good, in the long run, at reducing residual variance in your sample.

But if you are interested in reducing error and getting more precise predictions, presumably you primarily care about out of sample properties, not the precision within your sample. And therein lies the rub. When you control for a variable by manipulating it in some form (holding it constant, etc.), you create a situation that is more artificial than the original, natural observation. That is, experiments tend to have less external validity / generalizability than observational studies.

In case it's not clear, an example of a true experiment that holds something constant might be assessing a treatment in a mouse model using inbred mice that are all genetically identical. On the other hand, an example of controlling for a variable might be representing family history of disease by a dummy code and including that variable in a multiple regression model (cf., How exactly does one “control for other variables”?, and How can adding a 2nd IV make the 1st IV significant?).

Regression Analysis – Controlling for a Categorical Variable

You are right that categorial data can be encoded with dummy variables, but you only need $C-1$ dummy variables for $C$ levels. There are different methods to encode the levels with the dummy variables, but the easiest to understand is "treatment coding". If you have, e.g., three income sources "capital", "labour", "welfare" treatment coding uses two dummy variables as follows:

        labour welfare
capital      0       0
labour       1       0
welfare      0       1

In a linear regression, the intercept then describes the effect of "capital", intercept plus the coefficient of the dummy varaible "labour" describes the effect of "labour", and intercept plus the coefficient of the dummy varaible "welfare" describes the effect of "welfare".

Education can be treated the same way, although it might also be considered as an ordinal variable. Encoding it as a category allows for non-linear effects of this variable. In most situations, this does no harm, unless you explicitly want to suppress this.

If you also want to model different slopes per category with respect to other variables, you can use "interaction terms", which are specified by a specific syntax in statistical software. In R, it is * for interaction plus level dependent intercept or : for only level dependent slopes.

Best Answer

Related Solutions

Solved – the difference between controlling for a variable in a regression model vs. controlling for a variable in your study design

Regression Analysis – Controlling for a Categorical Variable

Related Question