For very extreme high dimensions in PCA, the number of dimensions $p$ is larger than the sample size $N$, does PCA work well or does it work at all? By 'work' I mean

does it work mathematically

If it can work mathematically, does it make any sense to use it for such high dimension.
If PCA works, how does it work in discrimination analysis? "PCA in discrimination analysis" means it tries to keep the most discrimination power rather than keep most variations as more variation does not necessarily mean the separation of different group of data is maximum. So Chang (1983) in his paper "On using principal components before separating a mixture of two multivariate normal distributions" suggest using Mahalanobis distance in some way to measure it. His paper can be found here: http://www.jstor.org/discover/10.2307/2347949?uid=3738032&uid=2129&uid=2&uid=70&uid=4&sid=21100901419681
Best Answer
I think you are very confused about what principal component analysis is and the Chang paper has added to your confusion. First we have multivariate data in say k dimensions the principal components are a particular transformation of the coordinates such that the first principal component exhibits the largest variation in the data for any one component. The second principal component is the component orthogonal to the first that exhibits the largest remaining variation in the data. The third is orthogonal to the first two and exhibits the most remaining variation (not exhibited by PC 1 or PC 2). The remaining principal components are defined analogously. A new k dimensional set of orthogonal basis vectors is thus constructed in this way.
The purpose is dimensionality reduction. What is commonly called principal component analysis is the determination of the principal components for the multivariate data set and then projecting the data into the first few principal components to get a good lower dimensional visualization of the shape of the data. The number of principal components taken is based on the total variance in the data explained by those components.
Now people think to apply PCA in clustering because they think the clusters can be visualized by looking in 2  dimension planes determined by these components. In the paper by Chang he suggests that the better thing to do in cluster analysis is to look at the last principal components which exhibit the least amount of variation. I guess in the example he looks at it is easier to see spatial separation because the in each cluster is more tightly packed. I am not sure that this can be made a general principle. Chang demonstrates his claim on a special example. The data are generated from a mixture of 2 k dimensional normal distributions with different means but identical covariance matrices. This makes the data appear to have 2 clusters. Apparently if you simulate data from these mixture distributions the cluster separation is best seen in the last few PCs.
Hopefully I have cleared things up about PCA regarding this question and your other one. Now regarding your question about the large p small N problem principal component analysis will still have all the same mathematical properties as for any other values for p and N. Regarding cluster analysis when pis large and N is small, any method of clustering will have difficulties in such as setting. Clustering will be difficult with PCA whether you apply it the conventional way or the way Chen suggests. But I see nothing special about PCA that would make it have more difficulties compared to other methods.