Solved – How to choose the 10 variables that explain the most variation in a wealth index

factor analysismodel selectionpcastata

I have household survey data with 32 questions about assets the household has or doesn't. I assume that taken together the answers to these asset questions (e.g. how many televisions does the household own) are an indication of wealth, and could be used to make a good index of wealth, e.g. using the first component in a principal components analysis.

What I want to do, however, is to choose 10 of these variables that jointly explain the largest possible proportion of the variation in wealth and use those as the questions in a shorter questionnaire that I am developing. What is the best way of doing this?

One possibility that has occurred to me is to calculate the wealth index using PCA then regress this on every possible combination (60 million or so I think) of 10 variables from the 32, and see which gets the highest R-squared. I'm hoping there's an easier way.

Ideally I'm looking to implement this in Stata.

Best Answer

I think you need to better define what you are looking for. You could have 10 variables that each individually account for 90% of the variance, but if that is the same 90% of the variance then that may not be interesting to you. Performing regression with L1 and/or L2 norms can help you to identify variables or groups of variables that correlate well with your data. There are also other techniques available such as Minimum Redundancy Maximum Relevance that help to select features that are strong predictors.

Related Question