Solved – Linear regression with depended predictor variables

linear modelmultiple regressionregression

Short story: Why is it important that the predictor variables of a linear regression model are independent? If I am not interested in the coefficient but only in the question, which predictor variable is the most significant, am I allowed to use dependent predictor variables?

Long story: We would like to analyze the "quality" of a coating by using a linear regression model. Our parameters are as follows:

  • $Y$: the responds variable, which measures the "quality" of the layer — it's a special property of the coating, which we are interested in.
  • $P_1,\ldots, P_n$: some independent predictor variables, e.g. coating temperature, material of the substrate, coating thickness etc. These predictor variables are controllable during the coating process.
  • $Q_1, \ldots, Q_m$: some dependent predictor variables, e.g. the density of the coated layer, it's hardness, and the coating rate etc. These predictor variables can't be directly controlled during the coating process. They depend on the variables $P_1,\ldots, P_n$ and maybe on some unknown variables $X_1, \ldots, X_k$. Therefore, they can be expressed as functions $Q_j = Q_j(P_1,\ldots, P_n, X_1, \ldots, X_k)$.

Usually, I would model the responds variable by the independent predictor variables. I would use the linear regression model
$$Y \sim \sum_{i=1}^n P_i$$
where I could include some interaction $P_i \cdot P_j$ terms as well. Then I would use the optimal parameter set for a coating experiment and verification of the quality.

However, according to the literature the most important parameter for the "quality" $Y$ is the dependent predictor $Q_1$. Unfortunately, we are quite sure that we are missing some independent predictor variables $X_1, \ldots, X_k$, because our model $Q_1(P_1,\ldots, P_n)$ deviates from the measured value $Q_1^{(measured)}$. Therefore, I would like to check, that the literature is correct and that $Q_1$ is indeed the most significant predictor variable. In order to do so, I would like to estimate the most significant terms by using the model
$$Y \sim \sum_{i=1}^n P_i + \sum_{i=1}^m Q_i$$
where I treat the dependent variables as if they were independent predictors. Furthermore, I would like to include some interaction terms ($P_i \cdot P_j, P_i \cdot Q_j$, and $Q_i \cdot Q_j$) again.

My colleague says, we should avoid this kind of analysis. His key argument is, that this is an assumption of each linear regression model. I totally agree, but this is not a convincing argument, but merely a statement. What would be a proper argument and what is a proper way to proceed?

Note: We are using a stepwise linear model, where we include predictor by predictor according to their significance. Therefore, …

  • we first include only the most significant predictor variable. It must be independent, because it's the only predictor variable in the model. Let's call this predictor $Q_1$ and let's assume that it's a function of three other predictors: $Q_1 = Q_1(P_1, P_2, P_3)$.
  • if we include the second predictor variable, there could be a dependence between the two predictors. E.g. let's consider that we include $P_2$. Now our linear regression model reads $Y = c_1 \cdot Q_1(P_1, P_2, P_3) + c_2 P_2$ and it could be, that $P_2$ has the second greatest significance level, only because we need to compensate the contribution of $Q_1$.

My intuition tells me, that this is just like in linear algebra: The independent predictor variables represent an orthogonal basis, while a set of dependent predictor variables (e.g. $Q_1$ and $P_2$) represent a non-orthogonal basis. In principle orthogonality is nice, because if I change one coefficient I don't have to compensate by changing an other coefficient as well. However, in principle I am allowed to use a non-orthogonal basis. Is this wrong?

PS: We already checked that the measurements of our parameters are fine. We do not have a measurement problem.

Best Answer

If predictor variables in linear regression are dependent, then significance of independent variables is undermined and not-so important predictors might be included in your model.

Suppose you include 2 predictor variables- diet and stress- that are dependent on each other. Your model would be :

weight = diet + stress

Influence of stress on weight gain might be due to amount of diet. So here, significance of diet is undermined. You might pick up stress as significant variable, when it's actually not.

You can read about multicollinearity to know further.

Related Question