Solved – Are there cases where PCA is more suitable than t-SNE

pcatsne

I want to see how 7 measures of text correction behaviour (time spent correcting the text, number of keystrokes, etc.) relate to each other. The measures are correlated. I ran a PCA to see how the measures projected onto PC1 and PC2, which avoided the overlap of running separate two-way correlation tests between the measures.

I was asked why not using t-SNE, since the relationship between some of the measures might be non-linear.

I can see how allowing for non-linearity would improve this, but I wonder if there is any good reason to use PCA in this case and not t-SNE? I'm not interested in clustering the texts according to their relationship to the measures, but rather in the relationship between the measures themselves.

(I guess EFA could also a better/another approach, but that's a different discussion.)
Compared to other methods, there are few posts on here about t-SNE, so the question seems worth asking.

Best Answer

$t$-SNE is a great piece of Machine Learning but one can find many reasons to use PCA instead of it. Of the top of my head, I will mention five. As most other computational methodologies in use, $t$-SNE is no silver bullet and there are quite a few reasons that make it a suboptimal choice in some cases. Let me mention some points in brief:

Stochasticity of final solution. PCA is deterministic; $t$-SNE is not. One gets a nice visualisation and then her colleague gets another visualisation and then they get artistic which looks better and if a difference of $0.03\%$ in the $KL(P||Q)$ divergence is meaningful... In PCA the correct answer to the question posed is guaranteed. $t$-SNE might have multiple minima that might lead to different solutions. This necessitates multiple runs as well as raises questions about the reproducibility of the results.
Interpretability of mapping. This relates to the above point but let's assume that a team has agreed in a particular random seed/run. Now the question becomes what this shows... $t$-SNE tries to map only local / neighbours correctly so our insights from that embedding should be very cautious; global trends are not accurately represented (and that can be potentially a great thing for visualisation). On the other hand, PCA is just a diagonal rotation of our initial covariance matrix and the eigenvectors represent a new axial system in the space spanned by our original data. We can directly explain what a particular PCA does.
Application to new/unseen data. $t$-SNE is not learning a function from the original space to the new (lower) dimensional one and that's a problem. On that matter, $t$-SNE is a non-parametric learning algorithm so approximating with parametric algorithm is an ill-posed problem. The embedding is learned by directly moving the data across the low dimensional space. That means one does not get an eigenvector or a similar construct to use in new data. In contrast, using PCA the eigenvectors offer a new axes system what can be directly used to project new data. _{[Apparently one could try training a deep-network to learn the $t$-SNE mapping (you can hear Dr. van der Maaten at ~46' of this video suggesting something along this lines) but clearly no easy solution exists.]}
Incomplete data. Natively $t$-SNE does not deal with incomplete data. In fairness, PCA does not deal with them either but numerous extensions of PCA for incomplete data (eg. probabilistic PCA) are out there and are almost standard modelling routines. $t$-SNE currently cannot handle incomplete data (aside obviously training a probabilistic PCA first and passing the PC scores to $t$-SNE as inputs).
The $k$ is not (too) small case. $t$-SNE solves a problem known as the crowding problem, effectively that somewhat similar points in higher dimension collapsing on top of each other in lower dimensions (more here). Now as you increase the dimensions used the crowding problem gets less severe ie. the problem you are trying to solve through the use of $t$-SNE gets attenuated. You can work around this issue but it is not trivial. Therefore if you need a $k$ dimensional vector as the reduced set and $k$ is not quite small the optimality of the produce solution is in question. PCA on the other hand offer always the $k$ best linear combination in terms of variance explained. (Thanks to @amoeba for noticing I made a mess when first trying to outline this point.)

I do not mention issues about computational requirements (eg. speed or memory size) nor issues about selecting relevant hyperparameters (eg. perplexity). I think these are internal issues of the $t$-SNE methodology and are irrelevant when comparing it to another algorithm.

To summarise, $t$-SNE is great but as all algorithms has its limitations when it comes to its applicability. I use $t$-SNE almost on any new dataset I get my hands on as an explanatory data analysis tool. I think though it has certain limitations that do not make it nearly as applicable as PCA. Let me stress that PCA is not perfect either; for example, the PCA-based visualisations are often inferior to those of $t$-SNE.

Related Solutions

Solved – do a PCA on repeated measures for data reduction

You could look into Multiple Factor Analysis. This can be implemented in R with FactoMineR.

UPDATE:

To elaborate, Leann was proposing – however long ago – to conduct a PCA on a dataset with repeated measures. If I understand the structure of her dataset correctly, for a given 'context' she had an animal x 'specific measure' (time to enter, number of times returning to shelter, etc) matrix. Each of the 64 animals (those without missing obs.) were followed three times. Let's say she had 10 'specific measures', so she would then have three 64×10 matrices on the animals' behaviour (we can call the matrices X1, X2, X3). To run a PCA on the three matrices simultaneously, she would have to 'row bind' the three matrices (e.g. PCA(rbind(X1,X2,X3))). But this ignores the fact that the first and 64th observation are on the same animal. To circumvent this problem, she can 'column bind' the three matrices and run them through a Multiple Factor Analysis. MFA is a useful way of analyzing multiple sets of variables measured on the same individuals or objects at different points in time. She'll be able to extract the principle components from the MFA in the same way as in a PCA but will have a single coordinate for each animal. The animal objects will now have been placed in a multivariate space of compromise delimited by her three observations.

She would be able to execute the analysis using the FactoMineR package in R. Example code would look something like:

df=data.frame(X1, X2, X3)
mfa1=MFA(df, group=c(10, 10, 10), type=c("s", "s", "s"), 
 name.group=c("Observation 1", "Observation 2", "Observation 3")) 
 #presuming the data is quantitative and needs to be scaled to unit variance

Also, instead of extracting the first three components from the MFA and putting them through multiple regression, she might think about projecting her explanatory variables directly onto the MFA as 'supplemental tables' (see ?FactoMineR). Another approach would be to calculate a Euclidean distance matrix of the object coordinates from the MFA (e.g. dist1=vegdist(mfa1$ind$coord, "euc")) and put it through an RDA with dist1 as a function of the animal specific variables (e.g. rda(dist1~age+sex+pedigree) using the vegan package).

PCA Objective Function – Connection Between Maximizing Variance and Minimizing Error

Let $\newcommand{\X}{\mathbf X}\X$ be a centered data matrix with $n$ observations in rows. Let $\newcommand{\S}{\boldsymbol \Sigma}\S=\X^\top\X/(n-1)$ be its covariance matrix. Let $\newcommand{\w}{\mathbf w}\w$ be a unit vector specifying an axis in the variable space. We want $\w$ to be the first principal axis.

According to the first approach, first principal axis maximizes the variance of the projection $\X \w$ (variance of the first principal component). This variance is given by the $$\mathrm{Var}(\X\w)=\w^\top\X^\top \X \w/(n-1)=\w^\top\S\w.$$

According to the second approach, first principal axis minimizes the reconstruction error between $\X$ and its reconstruction $\X\w\w^\top$, i.e. the sum of squared distances between the original points and their projections onto $\w$. The square of the reconstruction error is given by \begin{align}\newcommand{\tr}{\mathrm{tr}} \|\X-\X\w\w^\top\|^2 &=\tr\left((\X-\X\w\w^\top)(\X-\X\w\w^\top)^\top\right) \\ &=\tr\left((\X-\X\w\w^\top)(\X^\top-\w\w^\top\X^\top)\right) \\ &=\tr(\X\X^\top)-2\tr(\X\w\w^\top\X^\top)+\tr(\X\w\w^\top\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\X\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\w^\top\X^\top\X\w) \\ &=\mathrm{const} - \mathrm{const} \cdot \w^\top \S \w. \end{align}

Notice the minus sign before the main term. Because of that, minimizing the reconstruction error amounts to maximizing $\w^\top \S \w$, which is the variance. So minimizing reconstruction error is equivalent to maximizing the variance; both formulations yield the same $\w$.

Best Answer

Related Solutions

Solved – do a PCA on repeated measures for data reduction

PCA Objective Function – Connection Between Maximizing Variance and Minimizing Error

Related Question