Solved – PCA with all categorical factors prior a regression with a continuous response

anovapcarregression

I realize that similar questions have already been asked and answered, but I am in need of a bit more detail and specific advice as I am new to PCA and statistical methods in general. My question is also a bit broader because I will be putting it in context and I need to know if I'm even headed in the right direction.

I have a great deal of data. For each datapoint, there is one continuous response variable that I am interested in examining. Let's call it X.

There are also five or six categorical variables, most of which have between three and ten possible values. One of them, however (let's call it A), has 154 possible values, and to complicate things further, each datapoint can fall in 1-4 of those 154 categories. For the vast majority of them, they just take one of the 154 values, but about 10% of them take two or three values, and maybe 0.5% of them take four values. (I am actually considering including a discrete but quantitative variable that will be equal to the number of values taken by S, as I think it might also be a relevant factor affecting X.)

My ultimate goal here is twofold: to create a predictive model with multiple regression, and to use ANOVA to determine how much each of my variables' variance explains the variance in X.

Someone more familiar with statistics than I suggested that I start with PCA because both multiple regression and ANOVA assume that all factors are independent. I'm pretty sure there are some correlations between a few of my factors (though I have no idea what they are) so I figured PCA would be a good way to begin disentangling.

My questions are:

  1. Can I perform PCA given the categorical nature of my data? If so, what method should I use to "dummy code" the variables? If not, what method would be more effective?

  2. Will including a single discrete, quantitative variable (the number of values taken by A) complicate matters?

  3. Will PCA even do what I want? (namely, disentangling the variables so I can then use multiple regression and ANOVA)

  4. Whatever you recommend, is it possible in R, and if so, how? (I haven't even downloaded R yet but it's been recommended to me and it's free so I'm inclined to give it a swing. I have some programming experience in Python and C++ so in theory I can learn it without too much difficulty.)

Thanks very much in advance.

Best Answer

Neither regression nor ANOVA assume independence of the factors. If the correlation is severe, however, you might consider looking at http://en.wikipedia.org/wiki/Multiple_correspondence_analysis Here is a guide with some of the R packages referenced: http://factominer.free.fr/classical-methods/multiple-correspondence-analysis.html