PCA – Plotting a Discriminant Line on a Scatterplot in Discriminant Analysis

discriminant analysispcascatterplot

Given a data scatterplot I can plot the data's principal components on it, as axes tiled with points which are principal components scores. You can see an example plot with the cloud (consisting of 2 clusters) and its first principle component. It is drawn easily: raw component scores are computed as data-matrix x eigenvector(s); coordinate of each score point on the original axis (V1 or V2) is score x cos-between-the-axis-and-the-component (which is the element of the eigenvector).

1st principal component tiled by its scores

My question: Is it possible somehow to draw a discriminant in a similar fashion? Look at my pic please. I'd like to plot now the discriminant between two clusters, as a line tiled with discriminant scores (after discriminant analysis) as points. If yes, what could be the algo?

Best Answer

OK, since nobody answered I think that, after some experimentation, I can do it myself. Following discriminant analysis guidelines, let T be the whole cloud's (data X, of 2 variables) sscp matrix (of deviations from cloud's centre), and let W be the pooled within-cluster sscp matrix (of deviations from a cluster centre). B=T-W is the between-cluster sscp matrix. Singular value decomposition of inv(W)B yields us U (left eigenvectors), S (diagonal matrix of eigenvalues), V (right eigenvectors). In my example of 2 clusters only the 1st eigenvalue is nonzero (which means that there is only one discriminant), and so we use only the 1st eigenvector (column) of U: U(1). Now, XU(1) are the sought-for raw discriminant scores. To show the discriminant as a line tiled with those, multiply the scores by cos-between-the-axis-and-the-discriminant (which is the element of the eigenvector U(1)) - just as did it with principal component above. The resulting plot is below.

enter image description here

Inclusion/exclusion of variates (step 3):

I understand that you ask which of the original measurement channels to include into the modeling.

Is such a decision sensible for your data?
E.g. I work mainly with spectroscopic data, for which PLS is frequently and successfully used. Well measured spectra have a high correlation betweeen neighbour variates and the relevant information in spectroscopic data sets tends to be spread out over many variates. PLS is well suited for such data, but deciding on a variate-to-variate basis which variates to use for the model IMHO is usually not appropriate (decisions about inclusion/exclusion of spectral ranges based on spectroscopic knowledge about the application is IMHO a far better approach).
If for your data and application variable selection is a natural choice, is PLS the regularization technique you want?
You may want to read the sections about regularization (3.4 - 3.6) in the Elements of Statistical Learning where PLS as a regularization is compared to other regularization approaches. My point here is that in contrast to e.g. the Lasso, PLS is a regularization technique that does not tend to completely exclude variables from the model. I'd thus say that PLS is probably more suitable for data where this behaviour is sensible, but in that case variable selection is not a natural choice (e.g. spectroscopic data).
Does your data contain enough information for such a data-driven model optimization? Doing a t-test for each input variable is a massive multiple testing situation.
IMHO the main point of PLS (or other regularization techniques) is to avoid the need for such a variable selection.

Remark to Step 2:

If you build a linear regression model in PCA score space, that is usually called principal component regression (PCR) in chemometrics. It is not the same as a PLS model.

How to find out which variates are used by the PCA/PLS model?

There are several ways to approach this question. Obviously, variates where the PCA loadings or PLS weights are 0 do not enter the model. Whether it is sufficient to look at the loadings or whether you need to go a step further depends on your data: if the data set is not standardized you may want to calculate how much each variate "contributes" to the respective PCA/PLS score.
Literature where we did that with LDA (works just the same way): C. Beleites, K. Geiger, M. Kirsch, S. B. Sobottka, G. Schackert and R. Salzer: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal. Bioanal. Chem., 400 (2011), 2801 - 2816. The linked page has both links to the official web page and my manuscript.

You can also derive e.g. bootstrap distributions of the loadings (or the contributions) and have a look at them. For PCR and PLS coefficients that is straightforward, as the Y variable automatically "aligns" the coefficients. PCA and PLS scores need some more care, as e.g. flipping of the directions needs to be taken into account, and you may decide to treat models as equivalent where the scores which are then used for further modeling are just rotated or scaled versions of each other. Thus, you may want to align the scores first e.g. by Procrustes analysis. The paper linked above also discusses this (for LDA, but again, the ideas apply to the other bilinear models as well).

Last but not least, you need to be careful not to overinterprete the models, and you can have situations where important variates have coefficients frequently touching the zero mark in the bootstrap experiments if you have correlation between variates. However, ehat you can or cannot conclude will depend on your type of data, though.

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Principal components analysis and Linear discriminant analysis outputs; iris data.

I will not be drawing biplots because biplots can drawn with various normalizations and therefore may look different. Since I'm not R user I have difficulty to track down how you produced your plots, to repeat them. Instead, I will do PCA and LDA and show the results, in a manner similar to this (you might want to read). Both analyses done in SPSS.

Principal components of iris data:

The analysis will be based on covariances (not correlations) between the 4 variables.

Eigenvalues (component variances) and the proportion of overall variance explained
PC1   4.228241706    .924618723 
PC2    .242670748    .053066483 
PC3    .078209500    .017102610 
PC4    .023835093    .005212184 
# @Etienne's comment: 
# Eigenvalues are obtained in R by
# (princomp(iris[,-5])$sdev)^2 or (prcomp(iris[,-5])$sdev)^2.
# Proportion of variance explained is obtained in R by
# summary(princomp(iris[,-5])) or summary(prcomp(iris[,-5]))

Eigenvectors (cosines of rotation of variables into components)
              PC1           PC2           PC3           PC4
SLength   .3613865918   .6565887713  -.5820298513   .3154871929 
SWidth   -.0845225141   .7301614348   .5979108301  -.3197231037 
PLength   .8566706060  -.1733726628   .0762360758  -.4798389870 
PWidth    .3582891972  -.0754810199   .5458314320   .7536574253    
# @Etienne's comment: 
# This is obtained in R by
# prcomp(iris[,-5])$rotation or princomp(iris[,-5])$loadings

Loadings (eigenvectors normalized to respective eigenvalues;
loadings are the covariances between variables and standardized components)
              PC1           PC2           PC3           PC4
SLength    .743108002    .323446284   -.162770244    .048706863 
SWidth    -.173801015    .359689372    .167211512   -.049360829 
PLength   1.761545107   -.085406187    .021320152   -.074080509 
PWidth     .736738926   -.037183175    .152647008    .116354292    
# @Etienne's comment: 
# Loadings can be obtained in R with
# t(t(princomp(iris[,-5])$loadings) * princomp(iris[,-5])$sdev) or
# t(t(prcomp(iris[,-5])$rotation) * prcomp(iris[,-5])$sdev)

Standardized (rescaled) loadings
(loadings divided by st. deviations of the respective variables)
              PC1           PC2           PC3           PC4
SLength    .897401762     .390604412   -.196566721    .058820016
SWidth    -.398748472     .825228709    .383630296   -.113247642
PLength    .997873942    -.048380599    .012077365   -.041964868
PWidth     .966547516   -.048781602    .200261695    .152648309  

Raw component scores (Centered 4-variable data multiplied by eigenvectors)
     PC1           PC2           PC3           PC4
-2.684125626    .319397247   -.027914828    .002262437 
-2.714141687   -.177001225   -.210464272    .099026550 
-2.888990569   -.144949426    .017900256    .019968390 
-2.745342856   -.318298979    .031559374   -.075575817 
-2.728716537    .326754513    .090079241   -.061258593 
-2.280859633    .741330449    .168677658   -.024200858 
-2.820537751   -.089461385    .257892158   -.048143106 
-2.626144973    .163384960   -.021879318   -.045297871 
-2.886382732   -.578311754    .020759570   -.026744736 
-2.672755798   -.113774246   -.197632725   -.056295401 
... etc.
# @Etienne's comment: 
# This is obtained in R with
# prcomp(iris[,-5])$x or princomp(iris[,-5])$scores.
# Can also be eigenvector normalized for plotting

Standardized (to unit variances) component scores, when multiplied
by loadings return original centered variables.

It is important to stress that it is loadings, not eigenvectors, by which we typically interpret principal components (or factors in factor analysis) - if we need to interpret. Loadings are the regressional coefficients of modeling variables by standardized components. At the same time, because components don't intercorrelate, they are the covariances between such components and the variables. Standardized (rescaled) loadings, like correlations, cannot exceed 1, and are more handy to interpret because the effect of unequal variances of variables is taken off.

It is loadings, not eigenvectors, that are typically displayed on a biplot side-by-side with component scores; the latter are often displayed column-normalized.

Linear discriminants of iris data:

There is 3 classes and 4 variables: min(3-1,4)=2 discriminants can be extracted.
Only the extraction (no classification of data points) will be done.

The Within scatter matrix 
38.95620000   13.63000000   24.62460000    5.64500000 
13.63000000   16.96200000    8.12080000    4.80840000 
24.62460000    8.12080000   27.22260000    6.27180000 
 5.64500000    4.80840000    6.27180000    6.15660000 

The Between scatter matrix 
 63.2121333   -19.9526667   165.2484000    71.2793333 
-19.9526667    11.3449333   -57.2396000   -22.9326667 
165.2484000   -57.2396000   437.1028000   186.7740000 
 71.2793333   -22.9326667   186.7740000    80.4133333

Eigenvalues and canonical correlations
(Canonical correlation squared is SSbetween/SStotal of ANOVA by that discriminant)
Dis1    32.19192920     .98482089 
Dis2      .28539104     .47119702
# @Etienne's comment:
# In R eigenvalues are expected from
# lda(as.factor(Species)~.,data=iris)$svd, but this produces
#   Dis1       Dis2
# 48.642644  4.579983
# @ttnphns' comment:
# The difference might be due to different computational approach
# (e.g. me used eigendecomposition and R used svd?) and is of no importance.
# Canonical correlations though should be the same.

Eigenvectors
              Dis1          Dis2
SLength  -.0684059150   .0019879117 
SWidth   -.1265612055   .1785267025 
PLength   .1815528774  -.0768635659 
PWidth    .2318028594   .2341722673

Eigenvectors (as before, but column-normalized to SS=1: cosines of rotation of variables into discriminants).
              Dis1          Dis2
SLength  -.2087418215   .0065319640 
SWidth   -.3862036868   .5866105531 
PLength   .5540117156  -.2525615400 
PWidth    .7073503964   .7694530921

Unstandardized discriminant coefficients (proportionally related to eigenvectors)
              Dis1          Dis2
SLength   -.829377642    .024102149 
SWidth   -1.534473068   2.164521235 
PLength   2.201211656   -.931921210 
PWidth    2.810460309   2.839187853
# @Etienne's comment:
# This is obtained in R with
# lda(as.factor(Species)~.,data=iris)$scaling
# which is described as being standardized discriminant coefficients in the function definition.

Standardized discriminant coefficients
              Dis1          Dis2
SLength  -.4269548486   .0124075316 
SWidth   -.5212416758   .7352613085 
PLength   .9472572487  -.4010378190 
PWidth    .5751607719   .5810398645

Pooled within-groups correlations between variables and discriminants
              Dis1          Dis2
SLength   .2225959415   .3108117231 
SWidth   -.1190115149   .8636809224 
PLength   .7060653811   .1677013843 
PWidth    .6331779262   .7372420588 

Discriminant scores (Centered 4-variable data multiplied by unstandardized coefficients)
     Dis1           Dis2
-8.061799783    .300420621 
-7.128687721   -.786660426 
-7.489827971   -.265384488 
-6.813200569   -.670631068 
-8.132309326    .514462530 
-7.701946744   1.461720967 
-7.212617624    .355836209 
-7.605293546   -.011633838 
-6.560551593  -1.015163624 
-7.343059893   -.947319209
... etc.
# @Etienne's comment:
# This is obtained in R with
# predict(lda(as.factor(Species)~.,data=iris), iris[,-5])$x

About computations at extraction of discriminants in LDA please look here. We interpret discriminants usually by discriminant coefficients or standardized discriminant coefficients (the latter are more handy because differential variance in variables is taken off). This is like in PCA. But, note: the coefficients here are the regressional coefficients of modeling discriminants by variables, not vice versa, like it was in PCA. Because variables are not uncorrelated, the coefficients cannot be seen as covariances between variables and discriminants.

Yet we have another matrix instead which may serve as an alternative source of interpretation of discriminants - pooled within-group correlations between the discriminants and the variables. Because discriminants are uncorrelated, like PCs, this matrix is in a sense analogous to the standardized loadings of PCA.

In all, while in PCA we have the only matrix - loadings - to help interpret the latents, in LDA we have two alternative matrices for that. If you need to plot (biplot or whatever), you have to decide whether to plot coefficients or correlations.

And, of course, needless to remind that in PCA of iris data the components don't "know" that there are 3 classes; they can't be expected to discriminate classes. Discriminants do "know" there are classes and it is their natural job which is to discriminate.

Best Answer

Related Solutions

Solved – PCA and PLS: testing variables for significance

Inclusion/exclusion of variates (step 3):

Remark to Step 2:

How to find out which variates are used by the PCA/PLS model?

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Related Question