Solved – t-SNE on principal component scores: standardization needed

data visualizationpcastandardizationtsne

I have a huge dataset (1.5 million obs and 70 features). I want to visualize the data in 2D, to look for naturally occurring clusters. Analogous to Van der Maaten's approach 1, I first reduce the dimensions to 10 using PCA. Then I apply t-SNE to the dataset where now each obs is represented as a vector of 10 PCA scores.

My question is, while applying t-SNE, do I need to standardize each of the 10 score columns? In MATLAB, the suggestion is :"When features in X are on different scales, set 'Standardize' to true. Do this because the learning process is based on nearest neighbors, so features with large scales can override the contribution of features with small scales."
I am not sure if PC scores are on different scales or not. I know that the PC scores, on average fall in value with the PC.

Best Answer

The PC scores are in inherently on different scales; we can actually find the scale of each one by checking its corresponding eigenvalue. That said, no, do not normalise the PC scores. The suggestion you read relates to original input data prior to PCA. Given that your current input data are already valid PCA scores (i.e. created using a normalised sample as input) there is no reason to renormalise them again. If anything that will distort their importance.

Usual t-SNE implementations perform a PCA step internally to bring the dimensionality of the input data to a reasonable number. In R, the Rtsne::Rtsne() function by default uses $50$ dimensions as a "reasonable number of dimensions", in the 2008 and 2014 JMLR papers by van der Maaten this number is $30$. In any case though, we already provide PC scores as input we can skip that step. Performing PCA on PC scores will result to identical outputs (up to the sign).