Solved – Does the order of entering predictors in multiple regression model change the standardised Beta coefficients

multicollinearitymultiple regression

I am reviewing a number of research papers regarding domestic space heating energy consumption which used multiple regression techniques to identify the main determinants of space heating in the diverse (heterogeneous) types of homes that exist in north-western Europe.

The predictors are categorised into three groupings: building physics inputs (e.g. total floor area, age of building, insulation levels, whether double glazed or not, etc), socio-demographic inputs (age of occupants, household income, employment status) and heating behaviour inputs (proportion of rooms heated, heating hours per day, regular or irregular heating patterns).

In all research papers, all of the three categories' inputs were "lumped in" together (excuse my ignorance) to produce coefficients for all of the predictors, and a coefficient of multiple determination (plus an adjusted R-squared to account for sample size and the number of predictors).
After admitting all of the predictors into the model, each researcher proceeded to analyse each grouping separately, but always in the same order: first by introducing the building physics predictors, then by adding the socio-demographic predictors, and finally by adding behavioural predictors, with the hindsight that the building physics predictors are always the most dominant.

My question is: would it change the significance or the relative importance [beta values in most, but not all papers])of the predictors if, say, the behavioural factors were introduced first into the regression model, followed by the socio-demograhic factors and then followed by the building physics factors?

Each of the three groupings has around 10 predictors in each.

Collinearity becomes a problem when each further grouping is introduced (such as between higher income and larger house size), which sometimes results in the exclusion of (what I think is) a highly significant determinant (such as income, which is "over-shadowed" by house size). Actually, only one paper (published in 2015) excluded "income" using a Lasso regression technique.

I am aware of high VIF's (>5 or >10, depending on source reference) being used to identify collinearity, which would encourage exclusion of a predictor.

To re-iterate: does it matter in which order that predictors are introduced into a (any) multiple regression technique?

Best Answer

I think the answer may be fairly simple. Let's say you have 10 physical variables and 10 demographic variables. And, you can include all 20 variables in your model without running into any multicollinearity issues and statistical significance issues (all variables are statistically significant). In such a situation, the order of your variables make no difference since you are able to include them all. However, such a situation may be highly unlikely.

You are more likely to run into issues of statistical significance and multicollinearity. Those issues will force you to remove or not select some of the variables of either types. And, in such a situation the order will have a material impact on not only the selected variables in the model, but also both their regression coefficient and standardized coefficient. In other words, the order affects everything the minute you deal with a model that does not include all the variables or that you compare similar model that do not have an identical variable selection. But, if your model includes all 20 just fine, whether you start selecting them from 1 to 20 or 20 to 1 makes no difference.

Related Question