Solved – Reference on factor analysis with categorical variables

factor analysisordinal-data

I have been studying so many things on factor analysis with categorical variables. I am frustrated studying these many pdfs. I have 40 variables obtained from 40 questions. All of them are categorical. I can take them to be ordinal. The questions are different, so the answers are different too. It is not something like Likert scale where, for example, 1 means good for all questions, 2 means moderate for all questions, 3 means bad for all questions etc. It is like for one question:

"How often do senior management visit the wards to talk to staff?"

rarely or never ..................... 1    
around once a year................... 2
around once a month.................. 3
around once a week................... 4

For another question:

"What is the average amount of training (per person) received by a management staff?"

Less than a day ..................... 1
Less than a week .................... 2
One to two weeks .................... 3

Etc.

I want to factor analyze these variables: as can be seen here, 1, 2, 3 etc. differ in their meaning, and also the number of categories for each question differ.

Another problem is that I have non-response (missing values) in the data.

Here are my questions:

  1. What is the actual and best method of factor analysis with this kind of data. Besides could you please give me one good reference with your suggested methodology? I will be grateful for your help.
  2. If possible, please also give me indication on what to do with the missing values.
  3. I need to calculate factor scores from the analysis. I have tried polychoric correlation, but can't get factor scores by this. Factor scores are very important for my analysis. I can't make further analysis without them.

@this.is.not.a.nick: Thank you so much for your kind advice. I was suggested to use CATPCA too. But if using polychoric correlation can solve the problem of calculating factor scores then it's really great. But Andrea, chl and ttnphns, could you please confirm me whether fa.poly() or fa.parallel.poly() use principal component solution of the loadings and the specific factor or they just use maximum likelihood method for estimation of these parameters like factanal() command does? As I can not assume the data to be normally distributed so I guess using principal component method for estimating parameters will be good in this case.

If these functions don't use principal component solution then I think I can do one thing here-

  1. Calculate the polychoric correlation matrix r using package psych.

  2. Compute loadings

    f <- principal(r,nfactors=3,rotate="varimax",scores=T,residuals=T) #say, 3 factors taken
    l <- print(f$loadings[c(1:ncol(data)),],cutoff=.0001) #data means original data
    
  3. Compute scores

    h <- t(l)%*%l #communality
    s <- h%*%t(l) #as fhat_i=(L'L)^(-1)*L'*Z_i
    data1 <- t(data[1,])
    f1 <- s%*%data1 
    f1
    

And thus I can get $f_i$, $i=1,\dots,n$, manually from Bartlett's formula for scores for PC solution. (Reference: Applied Multivariate Statistical Analysis by Johnson and Wichern)

Now, for this procedure to be implemented there should not be any missing value (NA). So, if I take the missing values to be 0 by which for each question means "no comment" is there any problem? I think it will then act as just another category of my categorical variables. As the variables are categorical I think I should not take mean or median.

Am I right in my thoughts? Please suggest me. If possible please send me the paper regarding Polychoric versus Pearson correlations. I really had a lot of help from you and looking forward to contribute when I grow up.

Best Answer

I think the best method, in your case, is to factor analyze the polychoric correlation matrix. In R, the 'psych' package allows you to perform the polychoric factor analysis (by the fa.poly command) and also to compute the factor scores. Here the documentation and this web page may be useful.

Moreover, the 'psych' package contains the fa.parallel.poly function, that is very useful to choose the optimal number of factors to retain by Monte Carlo simulation.

With the missing values, you can either exlude them from analysis or replace them with the mean or the median values.

Here is a recent paper that confirms the superiority of the polychoric factor analysis:

Holgado–Tello, F. C., Chacón–Moscoso, S., Barbero–García, I., & Vila–Abad E. (2010). Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Quality and quantity, 44 (1), 153-166.

In response to your second questions, principal component analysis and factor analysis are not the same thing. If your aim is to simply reduce your data, so principal component is the election technique. Otherwise, if you want to explore the underlying dimensions of your questionnaire, you have to use factor analysis. In PCA, the components derive from the variables (by maximizing the variance), while in FA are the factors that explain the variables, so the pattern is opposite. To my knowledge, this is the only important aspect to keep in consideration when you have to choose between the two methods.

fa.poly conducts a FA, and you can specify the factoring method (GLS, WLS, PF...). If you want to conduct a PCA, I think you can use principal, but submitting to the analysis not the raw data but the polychoric correlations matrix. Check the 'psych' documentation for these aspects, I never done a categorical principal component analysis.