Correspondence Analysis – Interpreting 2D Correspondence Analysis Plots (Part II)

biplotcorrespondence-analysisr

I'd like to ensure that I understand the process correctly. This is a follow-up question to Interpreting 2D correspondence analysis plots

library(reshape) 
library(ca)

df <- read.csv(file="http://www.bertelsen.ca/R/smokers.csv")
colnames(df)[7] <- "value"  ## make reshape smart
df <- cast(df, SMOKER ~ GEO) ## reshape data
row.names(df) <- df$SMOKER ## rename rows
df <- df[2:ncol(df)] ## reset df
df <- df[-4,] ## Let's only look at people who have smoked
df <- df[c("AB","BC","ON","QC")] ## and only the biggest 4 provinces (KISS)
plot(ca(df))

summary(ca(df))

Output

Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      0.002523  99.9  99.9  *************************
 2      3e-06000   0.1 100.0                           
 3      00000000   0.0 100.0                           
        -------- -----                                 
 Total: 0.002526 100.0                                 


Rows:
    name   mass  qlt  inr    k=1  cor ctr    k=2 cor ctr  
1 | Crrn |  265 1000  191 |  -43 1000 191 |    1   0  43 |
2 | Dlys |  201 1000  351 |  -66 1000 351 |   -1   0  70 |
3 | Frmr |  470 1000  432 |   48 1000 432 |   -1   0  98 |
4 | Occs |   65 1000   26 |   31  964  25 |    6  36 789 |

Columns:
    name   mass  qlt  inr    k=1  cor ctr    k=2 cor ctr  
1 |   AB |  116 1000  146 |  -56 1000 146 |   -1   0  34 |
2 |   BC |  142 1000  775 |  118 1000 776 |   -1   0  41 |
3 |   ON |  434 1000    7 |   -6  909   6 |    2  91 540 |
4 |   QC |  308 1000   72 |  -24  994  72 |   -2   6 385 |

Looking at summary(ca(df)) I see that nearly 100% of the inertia is described by the row profile for both modalities (Type of smoker and Province, respectively).

CA of Smoker Types in ON, QC, AB, and BC

What (I think) should be immediate takeaways are:

You are more likely to be a daily smoker if you live in AB, QC, or ON
You are more likely to be a former smoker if you live in BC
You are least likely to be a daily smoker if you live in BC (this fits with Canadian wide understanding of BC's "active lifestyle" culture)

What could we say about occasional smokers? What would your analysis tell us through this correspondence plot and it's associated summary?

Data Source: Statistics Canada, Canadian Community Health Survey (CCHS 3.1), 2005. The CANSIM table 105-0427 was an update of CANSIM table 105-0227. More current data are in CANSIM tables 105-0501 and 105-0502.

Best Answer

I'm an ecologist, so I apologise in advance is this sounds a bit strange :-)

I like to think of these plots in terms of weighted averages. The region points are at the weighted averages of the smoking status classes and vice versa.

The problem with the above figure is the axis scaling and the fact that you can't display all the relationships (chi-square distance between regions and chi-square distance between smoking status) on the one figure. By the looks of it, the figure is using a what is known as symmetric scaling which has been shown to be a good compromise preserving as much of the information in the sets of scores as possible.

I'm not familiar with the ca package but I am with the vegan package and it's cca function:

require(vegan)
df <- data.frame(df)
ord <- cca(df)
plot(ord, scaling = 3)

The last plot is a bit easier to read than the one you show but AFAICT they are the same (or at least similarly scaled).

So I would say that occasional smokers are lower in number than expected in QC, BC and AB, and most associated with ON, but that in all regions, occasional smokers are low in number - they differ markedly from the expected number.

However, there is a single dominant "gradient" or axis of variation in these data and as the second axis represents so little variation, I would likely not interpret this component at all.

Related Solutions

Correspondence Analysis – Interpreting 2D Correspondence Analysis Plots

First, there are different ways to construct so-called biplots in the case of correspondence analysis. In all cases, the basic idea is to find a way to show the best 2D approximation of the "distances" between row cells and column cells. In other words, we seek a hierarchy (we also speak of "ordination") of the relationships between rows and columns of a contingency table.

Very briefly, CA decomposes the chi-square statistic associated with the two-way table into orthogonal factors that maximize the separation between row and column scores (i.e. the frequencies computed from the table of profiles). Here, you see that there is some connection with PCA but the measure of variance (or the metric) retained in CA is the $\chi^2$, which only depends on column profiles (As it tends to give more importance to modalities that have large marginal values, we can also re-weight the initial data, but this is another story).

Here is a more detailed answer. The implementation that is proposed in the corresp() function (in MASS) follows from a view of CA as an SVD decomposition of dummy coded matrices representing the rows and columns (such that $R^tC=N$, with $N$ the total sample). This is in light with canonical correlation analysis. In contrast, the French school of data analysis considers CA as a variant of the PCA, where you seek the directions that maximize the "inertia" in the data cloud. This is done by diagonalizing the inertia matrix computed from the centered and scaled (by marginals frequencies) two-way table, and expressing row and column profiles in this new coordinate system.

If you consider a table with $i=1,\dots,I$ rows, and $j=1,\dots,J$ columns, each row is weighted by its corresponding marginal sum which yields a series of conditional frequencies associated to each row: $f_{j|i}=n_{ij}/n_{i\cdot}$. The marginal column is called the mean profile (for rows). This gives us a vector of coordinates, also called a profile (by row). For the column, we have $f_{i|j}=n_{ij}/n_{\cdot j}$. In both cases, we will consider the $I$ row profiles (associated to their weight $f_{i\cdot}$) as individuals in the column space, and the $J$ column profiles (associated to their weight $f_{\cdot j}$) as individuals in the row space. The metric used to compute the proximity between any two individuals is the $\chi^2$ distance. For instance, between two rows $i$ and $i'$, we have

$$ d^2_{\chi^2}(i,i')=\sum_{j=1}^J\frac{n}{n_{\cdot j}}\left(\frac{n_{ij}}{n_{i\cdot}}-\frac{n_{i'j}}{n_{i'\cdot}} \right)^2 $$

You may also see the link with the $\chi^2$ statistic by noting that it is simply the distance between observed and expected counts, where expected counts (under $H_0$, independence of the two variables) are computed as $n_{i\cdot}\times n_{\cdot j}/n$ for each cell $(i,j)$. If the two variables were to be independent, the row profiles would be all equal, and identical to the corresponding marginal profile. In other words, when there is independence, your contingency table is entirely determined by its margins.

If you realize an PCA on the row profiles (viewed as individuals), replacing the euclidean distance by the $\chi^2$ distance, then you get your CA. The first principal axis is the line that is the closest to all points, and the corresponding eigenvalue is the inertia explained by this dimension. You can do the same with the column profiles. It can be shown that there is a symmetry between the two approaches, and more specifically that the principal components (PC) for the column profiles are associated to the same eigenvalues than the PCs for the row profiles. What is shown on a biplot is the coordinates of the individuals in this new coordinate system, although the individuals are represented in a separate factorial space. Provided each individual/modality is well represented in its factorial space (you can look at the $\cos^2$ of the modality with the 1st principal axis, which is a measure of the correlation/association), you can even interpret the proximity between elements $i$ and $j$ of your contingency table (as can be done by looking at the residuals of your $\chi^2$ test of independence, e.g. chisq.test(tab)$expected-chisq.test(tab)$observed).

The total inertia of your CA (= the sum of eigenvalues) is the $\chi^2$ statistic divided by $n$ (which is Pearson's $\phi^2$).

Actually, there are several packages that may provide you with enhanced CAs compared to the function available in the MASS package: ade4, FactoMineR, anacor, and ca.

The latest is the one that was used for your particular illustration, and a paper was published in the Journal of Statistical Software that explains most of its functionnalities: Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca Package.

So, your example on eye/hair colors can be reproduced in many ways:

data(HairEyeColor)
tab <- apply(HairEyeColor, c(1, 2), sum) # aggregate on gender
tab

library(MASS)
plot(corresp(tab, nf=2))
corresp(tab, nf=2)

library(ca)
plot(ca(tab))
summary(ca(tab, nd=2))

library(FactoMineR)
CA(tab)
CA(tab, graph=FALSE)$eig  # == summary(ca(tab))$scree[,"values"]
CA(tab, graph=FALSE)$row$contrib

library(ade4)
scatter(dudi.coa(tab, scannf=FALSE, nf=2))

In all cases, what we read in the resulting biplot is basically (I limit my interpretation to the 1st axis which explained most of the inertia):

the first axis highlights the clear opposition between light and dark hair color, and between blue and brown eyes;
people with blond hair tend to also have blue eyes, and people with black hair tend to have brown eyes.

There is a lot of additional resources on data analysis on the bioinformatics lab from Lyon, in France. This is mostly in French, but I think it would not be too much a problem for you. The following two handouts should be interesting as a first start:

Finally, when you consider a full disjonctive (dummy) coding of $k$ variables, you get the multiple correspondence analysis.

Solved – Interpreting multiple correspondence analysis

The standard visualisation is the biplot. The interpretation depends on the details of the technique applied but will usually lean on some notion of inner product. But since I don't know what SPSS does when you ask for MCA then I hesitate to offer more concrete advice. Nevertheless you'll surely find all you need to interpret them in the (free) book Biplots in Practice, specifically chapters 9-10.

However, if you're wondering how to interpret its output then you might profitably first revise your theory of correspondence analysis. Greenacre's CA in Practice is a good applied text. Ch. 9 covers biplots and ch. 16-20 revise the multi-way extensions of simple correspondence analysis (they are short chapters). That should provide enough background to see what SPSS is offering you.

As @ttnphns points out, a two way table implies simple rather than multiple correspondence analysis. Then things are indeed easier (but still see references above).

Best Answer

Related Solutions

Correspondence Analysis – Interpreting 2D Correspondence Analysis Plots

Solved – Interpreting multiple correspondence analysis

Related Question