Under the null hypothesis, your p-values should adhere to a uniform distribution (I'm ignoring some issues with dependencies here, as they are not really relevant to the discussion). This includes the false positives: just by coincidence, you should get a few extreme p-values, but since this is exactly what is in the expected distribution, this will not distort the QQ-plot: false positives are already accounted for in this type of QQ-plot.
On the other hand, if some SNPs were tested where the null hypothesis is not true, typically this should show up as a lower p-value than expected, and these will indeed distort the image.
What the author of the article you link seems to mean is: if you have wicked strong distortion, there are unexpectedly (as per the uniform distribution) many extreme p-values, so something is likely to be wrong.
So, in short: a uniform QQ-plot of p-values (like this) should show relatively little digression from the straight line, and only in the extremely low p-values (showing you have more low p-values than if the null hypothesis were true everywhere).
Best Answer
In this particular context, PCA is mainly used to account for population-specific variations in alleles distribution on the SNPs (or other DNA markers, although I'm only familiar with the SNP case) under investigation. Such "population substructure" mainly arises as a consequence of varying frequencies of minor alleles in genetically distant ancestries (e.g. japanese and black-african or european-american). The general idea is well explained in Population Structure and Eigenanalysis, by Patterson et al. (PLoS Genetics 2006, 2(12)), or the Lancet's special issue on genetic epidemiology (2005, 366; most articles can be found on the web, start with Cordell & Clayton, Genetic Association Studies).
The construction of principal axes follows from the classical approach to PCA, which is applied to the scaled matrix (individuals by SNPs) of observed genotypes (AA, AB, BB; say B is the minor allele in all cases), to the exception that an additional normalization to account for population drift might be applied. It all assumes that the frequency of the minor allele (taking value in {0,1,2}) can be considered as numeric, that is we work under an additive model (also called allelic dosage) or any equivalent one that would make sense. As the successive orthogonal PCs will account for the maximum variance, this provides a way to highlight groups of individuals differing at the level of minor allele frequency. The software used for this is known as Eigenstrat. It is also available in the
egscore()
function from the GenABEL R package (see also GenABEL.org). It is worth to note that other methods to detect population substructure were proposed, in particular model-based cluster reconstruction (see references at the end). More information can be found by browsing the Hapmap project, and available tutorial coming from the Bioconductor project. (Search for Vince J Carey or David Clayton's nice tutorials on Google).Apart from clustering subpopulations, this approach can also be used for detecting outliers which might arise in two cases (AFAIK): (a) genotyping errors, and (b) when working with an homogeneous population (or assumed so, given self-reported ethnicity), individuals exhibiting unexpected genotype. What is usually done in this case is to apply PCA in an iterative manner, and remove individuals whose scores are below $\pm 6$ SD on at least one of the first 20 principal axes; this amounts to "whiten" the sample, in some sense. Note that any such measure of genotype distance (this also holds when using Multidimensional Scaling in place of PCA) will allow to spot relatives or siblings. The plink software provides additional methods, see the section on Population stratification in the on-line help.
Considering that eigenanalysis allows to uncover some structure at the level of the individuals, we can use this information when trying to explain observed variations in a given phenotype (or any distribution that might be defined according to a binary criterion, e.g. disease or case-control situation). Specifically, we can adjust our analysis with those PCs (i.e., the factor scores of individuals), as illustrated in Principal components analysis corrects for stratification in genome-wide association studies, by Price et al. (Nature Genetics 2006, 38(8)), and later work (there was a nice picture showing axes of genetic variation in Europe in Genes mirror geography within Europe; Nature 2008; Fig 1A reproduced below). Note also that another solution is to carry out a stratified analysis (by including ethnicity in an GLM)--this is readily available in the snpMatrix package, for example.
References