Solved – Are there cases where PCA is more suitable than t-SNE

pcatsne

I want to see how 7 measures of text correction behaviour (time spent correcting the text, number of keystrokes, etc.) relate to each other. The measures are correlated. I ran a PCA to see how the measures projected onto PC1 and PC2, which avoided the overlap of running separate two-way correlation tests between the measures.

I was asked why not using t-SNE, since the relationship between some of the measures might be non-linear.

I can see how allowing for non-linearity would improve this, but I wonder if there is any good reason to use PCA in this case and not t-SNE? I'm not interested in clustering the texts according to their relationship to the measures, but rather in the relationship between the measures themselves.

(I guess EFA could also a better/another approach, but that's a different discussion.)
Compared to other methods, there are few posts on here about t-SNE, so the question seems worth asking.

Best Answer

$t$-SNE is a great piece of Machine Learning but one can find many reasons to use PCA instead of it. Of the top of my head, I will mention five. As most other computational methodologies in use, $t$-SNE is no silver bullet and there are quite a few reasons that make it a suboptimal choice in some cases. Let me mention some points in brief:

  1. Stochasticity of final solution. PCA is deterministic; $t$-SNE is not. One gets a nice visualisation and then her colleague gets another visualisation and then they get artistic which looks better and if a difference of $0.03\%$ in the $KL(P||Q)$ divergence is meaningful... In PCA the correct answer to the question posed is guaranteed. $t$-SNE might have multiple minima that might lead to different solutions. This necessitates multiple runs as well as raises questions about the reproducibility of the results.

  2. Interpretability of mapping. This relates to the above point but let's assume that a team has agreed in a particular random seed/run. Now the question becomes what this shows... $t$-SNE tries to map only local / neighbours correctly so our insights from that embedding should be very cautious; global trends are not accurately represented (and that can be potentially a great thing for visualisation). On the other hand, PCA is just a diagonal rotation of our initial covariance matrix and the eigenvectors represent a new axial system in the space spanned by our original data. We can directly explain what a particular PCA does.

  3. Application to new/unseen data. $t$-SNE is not learning a function from the original space to the new (lower) dimensional one and that's a problem. On that matter, $t$-SNE is a non-parametric learning algorithm so approximating with parametric algorithm is an ill-posed problem. The embedding is learned by directly moving the data across the low dimensional space. That means one does not get an eigenvector or a similar construct to use in new data. In contrast, using PCA the eigenvectors offer a new axes system what can be directly used to project new data. [Apparently one could try training a deep-network to learn the $t$-SNE mapping (you can hear Dr. van der Maaten at ~46' of this video suggesting something along this lines) but clearly no easy solution exists.]

  4. Incomplete data. Natively $t$-SNE does not deal with incomplete data. In fairness, PCA does not deal with them either but numerous extensions of PCA for incomplete data (eg. probabilistic PCA) are out there and are almost standard modelling routines. $t$-SNE currently cannot handle incomplete data (aside obviously training a probabilistic PCA first and passing the PC scores to $t$-SNE as inputs).

  5. The $k$ is not (too) small case. $t$-SNE solves a problem known as the crowding problem, effectively that somewhat similar points in higher dimension collapsing on top of each other in lower dimensions (more here). Now as you increase the dimensions used the crowding problem gets less severe ie. the problem you are trying to solve through the use of $t$-SNE gets attenuated. You can work around this issue but it is not trivial. Therefore if you need a $k$ dimensional vector as the reduced set and $k$ is not quite small the optimality of the produce solution is in question. PCA on the other hand offer always the $k$ best linear combination in terms of variance explained. (Thanks to @amoeba for noticing I made a mess when first trying to outline this point.)

I do not mention issues about computational requirements (eg. speed or memory size) nor issues about selecting relevant hyperparameters (eg. perplexity). I think these are internal issues of the $t$-SNE methodology and are irrelevant when comparing it to another algorithm.

To summarise, $t$-SNE is great but as all algorithms has its limitations when it comes to its applicability. I use $t$-SNE almost on any new dataset I get my hands on as an explanatory data analysis tool. I think though it has certain limitations that do not make it nearly as applicable as PCA. Let me stress that PCA is not perfect either; for example, the PCA-based visualisations are often inferior to those of $t$-SNE.