In my experiment I have used 30 different accessions of a species. One group is challenged drought and the other group is control. I have collected data on 6 different parameters. I want to know which accession is more tolerant or susceptible, which accession is more affected by which variable (parameter), and which variable is most critical etc. Can I use principal component analysis? Should I combine data during PCA analysis or subtract treatment data from control group? or how can I do PCR with both group data?
Solved – Principal component analysis with group data
ecologygroup-differencespca
Related Solutions
I think you are very confused about what principal component analysis is and the Chang paper has added to your confusion. First we have multivariate data in say k dimensions the principal components are a particular transformation of the coordinates such that the first principal component exhibits the largest variation in the data for any one component. The second principal component is the component orthogonal to the first that exhibits the largest remaining variation in the data. The third is orthogonal to the first two and exhibits the most remaining variation (not exhibited by PC 1 or PC 2). The remaining principal components are defined analogously. A new k dimensional set of orthogonal basis vectors is thus constructed in this way.
The purpose is dimensionality reduction. What is commonly called principal component analysis is the determination of the principal components for the multivariate data set and then projecting the data into the first few principal components to get a good lower dimensional visualization of the shape of the data. The number of principal components taken is based on the total variance in the data explained by those components.
Now people think to apply PCA in clustering because they think the clusters can be visualized by looking in 2 - dimension planes determined by these components. In the paper by Chang he suggests that the better thing to do in cluster analysis is to look at the last principal components which exhibit the least amount of variation. I guess in the example he looks at it is easier to see spatial separation because the in each cluster is more tightly packed. I am not sure that this can be made a general principle. Chang demonstrates his claim on a special example. The data are generated from a mixture of 2 k dimensional normal distributions with different means but identical covariance matrices. This makes the data appear to have 2 clusters. Apparently if you simulate data from these mixture distributions the cluster separation is best seen in the last few PCs.
Hopefully I have cleared things up about PCA regarding this question and your other one. Now regarding your question about the large p small N problem principal component analysis will still have all the same mathematical properties as for any other values for p and N. Regarding cluster analysis when pis large and N is small, any method of clustering will have difficulties in such as setting. Clustering will be difficult with PCA whether you apply it the conventional way or the way Chen suggests. But I see nothing special about PCA that would make it have more difficulties compared to other methods.
Since you know the group memberships, you can use a supervised approach rather than doing unsupervised clustering. Note that gene-expression data can often be modeled best in log-expression scale, which would be equivalent to $-\Delta C_t$ in your PCR data.
One approach that could accomplish both goals is to use the group membership as the outcome values in a multinomial model and the gene-expression values as ridge-regression predictors. Ridge regression is related to the principal-components regression used for dimension reduction, but with principal components weighted continuously rather than all-or-none.
Multinomial ridge regression is implemented for example by the glmnet()
function in the eponymous R package, when called with parameter settings of family = "multinomial"
and alpha = 0
. You use cross-validation (with cv.glmnet()
) to find a penalty value for the ridge regression that lowers the magnitudes of regression coefficients to minimize the chance of over-fitting the data.
Overall fit of the model will tell you whether "expression of this panel of genes [can] distinguish patients from different groups." Coefficients of individual genes will tell you which genes tend to distinguish the groups.
With so few cases in each group it's unlikely that you will have very robust or reliable results, however.
Best Answer
In general, I wouldn't see a problem why you couldn't do a PCA to visualize and interpret your multivariate dataset (however since you didn't provide data, I cannot say for sure). As for your second question, I would keep the two groups (
drought
,control
) and not subtract them from each other. That way you will be able to see if the component scores (illustrated as points in the plot) will cluster and how the component loadings (illustrated as vectors in the plot) relate to them.Here an example to illustrate what I mean (also your third question):
Generate an example dataset (based on your description):
The following steps can be achieved in a lot of different ways (also perhaps better and more efficient) with different
R
packages. But here is what usually works for me:PCA using FactoMineR:
Build the plot with the
ggplot2
package:The vectors represent components loadings, which are the correlations of the principal components with the original variables. The strength of the correlation is indicated by the vector length, and the direction indicates which accessions have high values for the original variables.
Also I would suggest having a look here for more information on how to interpret PCAs in general (if that's needed at all).
Also since you have predetermined groups, i.e.
drought
andcontrol
you might also have a look at linear discriminant analysis (LDA). Both, PCA and LDA, are rotation-based techniques. While PCA tries to maximize total variance explained in the dataset, LDA maximizes the separation (or discriminates) between groups. For more information you could have a look at thecandisc
function in thecandisc
package, or thelda()
function in theMASS
package for example (both inR
).