Solved – In genome-wide association studies, what are principal components

geneticsgwaspca

In genome-wide association studies (GWAS):

  1. What are the principal components?
  2. Why are they used?
  3. How are they calculated?
  4. Can a genome-wide association study be done without using PCA?

Best Answer

In this particular context, PCA is mainly used to account for population-specific variations in alleles distribution on the SNPs (or other DNA markers, although I'm only familiar with the SNP case) under investigation. Such "population substructure" mainly arises as a consequence of varying frequencies of minor alleles in genetically distant ancestries (e.g. japanese and black-african or european-american). The general idea is well explained in Population Structure and Eigenanalysis, by Patterson et al. (PLoS Genetics 2006, 2(12)), or the Lancet's special issue on genetic epidemiology (2005, 366; most articles can be found on the web, start with Cordell & Clayton, Genetic Association Studies).

The construction of principal axes follows from the classical approach to PCA, which is applied to the scaled matrix (individuals by SNPs) of observed genotypes (AA, AB, BB; say B is the minor allele in all cases), to the exception that an additional normalization to account for population drift might be applied. It all assumes that the frequency of the minor allele (taking value in {0,1,2}) can be considered as numeric, that is we work under an additive model (also called allelic dosage) or any equivalent one that would make sense. As the successive orthogonal PCs will account for the maximum variance, this provides a way to highlight groups of individuals differing at the level of minor allele frequency. The software used for this is known as Eigenstrat. It is also available in the egscore() function from the GenABEL R package (see also GenABEL.org). It is worth to note that other methods to detect population substructure were proposed, in particular model-based cluster reconstruction (see references at the end). More information can be found by browsing the Hapmap project, and available tutorial coming from the Bioconductor project. (Search for Vince J Carey or David Clayton's nice tutorials on Google).

Apart from clustering subpopulations, this approach can also be used for detecting outliers which might arise in two cases (AFAIK): (a) genotyping errors, and (b) when working with an homogeneous population (or assumed so, given self-reported ethnicity), individuals exhibiting unexpected genotype. What is usually done in this case is to apply PCA in an iterative manner, and remove individuals whose scores are below $\pm 6$ SD on at least one of the first 20 principal axes; this amounts to "whiten" the sample, in some sense. Note that any such measure of genotype distance (this also holds when using Multidimensional Scaling in place of PCA) will allow to spot relatives or siblings. The plink software provides additional methods, see the section on Population stratification in the on-line help.

Considering that eigenanalysis allows to uncover some structure at the level of the individuals, we can use this information when trying to explain observed variations in a given phenotype (or any distribution that might be defined according to a binary criterion, e.g. disease or case-control situation). Specifically, we can adjust our analysis with those PCs (i.e., the factor scores of individuals), as illustrated in Principal components analysis corrects for stratification in genome-wide association studies, by Price et al. (Nature Genetics 2006, 38(8)), and later work (there was a nice picture showing axes of genetic variation in Europe in Genes mirror geography within Europe; Nature 2008; Fig 1A reproduced below). Note also that another solution is to carry out a stratified analysis (by including ethnicity in an GLM)--this is readily available in the snpMatrix package, for example.

genes mirror geography in europe

References

  1. Daniel Falush, Matthew Stephens, and Jonathan K Pritchard (2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164(4): 1567–1587.
  2. B Devlin and K Roeder (1999). Genomic control for association studies. Biometrics, 55(4): 997–1004.
  3. JK Pritchard, M Stephens, and P Donnelly (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2): 945–959.
  4. Gang Zheng, Boris Freidlin, Zhaohai Li, and Joseph L Gastwirth (2005). Genomic control for association studies under various genetic models. Biometrics, 61(1): 186–92.
  5. Chao Tian, Peter K. Gregersen, and Michael F. Seldin1 (2008). Accounting for ancestry: population substructure and genome-wide association studies. Human Molecular Genetics, 17(R2): R143-R150.
  6. Kai Yu, Population Substructure and Control Selection in Genome-wide Association Studies.
  7. Alkes L. Price, Noah A. Zaitlen, David Reich and Nick Patterson (2010). New approaches to population stratification in genome-wide association studies, Nature Reviews Genetics
  8. Chao Tian, et al. (2009). European Population Genetic Substructure: Further Definition of Ancestry Informative Markers for Distinguishing among Diverse European Ethnic Groups, Molecular Medicine, 15(11-12): 371–383.