Solved – Principal Components for dependent variable in a regression


My question is related to PCA. I want to estimate the effect of agriculture variability on the welfare measures for financialy included people against the excluded. For this purpose I am trying to generate an index of welfare measure through PCA by having it as a function of four sets of households expenses; food consumption, non-food consumption, household expenditure and expenditure on durable goods.

  1. Can PCA be used to create an index which can then be used as a dependent variable (Household welfare index through PCA in the present case)?

  2. If yes, then what is the way to make interpretation of betas on the independent variables used in regression. Say for example Age of the Household head.

Best Answer

I think mathematically, what you are attempting to do may be feasible (answer to question 1). As far as interpretation of betas (question 2), PCA is already extremely challenging to interpret as is in the best of cases. Usually, most of the explanatory power is concentrated in the first Principal Component. When studying the composition of the Betas on the variables of that first Principal Component, you often observe results that are counterintuitive and cryptic. And, extending interpretation to the second and third Components is most often as baffling.

PCA is best used for two reasons: 1) streamline a large number of independent variables into three Principal Components; and 2) resolve issues of multicollinearity associated with a very large number of independent variables. However, PCA has major drawbacks. It is so opaque (opposite of transparent). It renders clear interpretation of the results from very challenging to impossible.

An alternative to PCA is simply to streamline your model. Use a clearly defined and straigthforward dependent variable. Explore tens of independent variables if you wish, but select judiciously the best 6 or 7 ones that make the most sense supported by social sciences and economic theory. Usually, this generates a simple, explanatory model that is easy to interpret and present to various audiences. You can add more variables, but watch out for model overfitting. The latter is actually a mathematical trap that PCA models can readily fall into (overfitting the data, meanwhile being very poor predictors because they fit to the noise within the data).

Related Question