Solved – t-SNE on principal component scores: standardization needed

data visualizationpcastandardizationtsne

I have a huge dataset (1.5 million obs and 70 features). I want to visualize the data in 2D, to look for naturally occurring clusters. Analogous to Van der Maaten's approach 1, I first reduce the dimensions to 10 using PCA. Then I apply t-SNE to the dataset where now each obs is represented as a vector of 10 PCA scores.

My question is, while applying t-SNE, do I need to standardize each of the 10 score columns? In MATLAB, the suggestion is :"When features in X are on different scales, set 'Standardize' to true. Do this because the learning process is based on nearest neighbors, so features with large scales can override the contribution of features with small scales."
I am not sure if PC scores are on different scales or not. I know that the PC scores, on average fall in value with the PC.

Best Answer

The PC scores are in inherently on different scales; we can actually find the scale of each one by checking its corresponding eigenvalue. That said, no, do not normalise the PC scores. The suggestion you read relates to original input data prior to PCA. Given that your current input data are already valid PCA scores (i.e. created using a normalised sample as input) there is no reason to renormalise them again. If anything that will distort their importance.

Usual t-SNE implementations perform a PCA step internally to bring the dimensionality of the input data to a reasonable number. In R, the Rtsne::Rtsne() function by default uses $50$ dimensions as a "reasonable number of dimensions", in the 2008 and 2014 JMLR papers by van der Maaten this number is $30$. In any case though, we already provide PC scores as input we can skip that step. Performing PCA on PC scores will result to identical outputs (up to the sign).

Related Solutions

Solved – What are principal component scores

First, let's define a score.

John, Mike and Kate get the following percentages for exams in Maths, Science, English and Music as follows:

      Maths    Science    English    Music    
John  80        85          60       55  
Mike  90        85          70       45
Kate  95        80          40       50

In this case there are 12 scores in total. Each score represents the exam results for each person in a particular subject. So a score in this case is simply a representation of where a row and column intersect.

Now let's informally define a Principal Component.

In the table above, can you easily plot the data in a 2D graph? No, because there are four subjects (which means four variables: Maths, Science, English, and Music), i.e.:

You could plot two subjects in the exact same way you would with $x$ and $y$ co-ordinates in a 2D graph.
You could even plot three subjects in the same way you would plot $x$, $y$ and $z$ in a 3D graph (though this is generally bad practice, because some distortion is inevitable in the 2D representation of 3D data).

But how would you plot 4 subjects?

At the moment we have four variables which each represent just one subject. So a method around this might be to somehow combine the subjects into maybe just two new variables which we can then plot. This is known as Multidimensional scaling.

Principal Component analysis is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retain maximal amount of information about the variables. For example, this would mean we could look at the types of subjects each student is maybe more suited to.

A principal component is therefore a combination of the original variables after a linear transformation. In R, this is:

DF <- data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80),  
                 English=c(60, 70, 40), Music=c(55, 45, 50))
prcomp(DF, scale = FALSE)

Which will give you something like this (first two Principal Components only for sake of simplicity):

                PC1         PC2
Maths    0.27795606  0.76772853 
Science -0.17428077 -0.08162874 
English -0.94200929  0.19632732 
Music    0.07060547 -0.60447104

The first column here shows coefficients of linear combination that defines principal component #1, and the second column shows coefficients for principal component #2.

So what is a Principal Component Score?

It's a score from the table at the end of this post (see below).

The above output from R means we can now plot each person's score across all subjects in a 2D graph as follows. First, we need to center the original variables by subtracting column means:

      Maths    Science    English    Music    
John  -8.33       1.66       3.33       5  
Mike   1.66       1.66      13.33      -5
Kate   6.66       -3.33    -16.66       0

And then to form linear combinations to get PC1 and PC2 scores:

      x                                                    y
John -0.28*8.33 + -0.17*1.66 + -0.94*3.33  + 0.07*5   -0.77*8.33 + -0.08*1.66 + 0.19*3.33   + -0.60*5 
Mike 0.28*1.66  + -0.17*1.66 + -0.94*13.33 + -0.07*5   0.77*1.66 + -0.08*1.66 + 0.19*13.33  + -0.60*5
Kate 0.28*6.66  + 0.17*3.33  + 0.94*16.66  + 0.07*0    0.77*6.66 +  0.08*3.33 + -0.19*16.66 + -0.60*0

Which simplifies to:

        x       y
John   -5.39   -8.90
Mike  -12.74    6.78
Kate   18.13    2.12

There are six principal component scores in the table above. You can now plot the scores in a 2D graph to get a sense of the type of subjects each student is perhaps more suited to.

The same output can be obtained in R by typing prcomp(DF, scale = FALSE)$x.

EDIT 1: Hmm, I probably could have thought up a better example, and there is more to it than what I've put here, but I hope you get the idea.

EDIT 2: full credit to @drpaulbrewer for his comment in improving this answer.

Solved – Is principal component regression (PCR) using principal component scores for regression

I think the wikipedia article is being a little sloppy in saying "uses principal component analysis when estimating regression coefficients". Better might be something like "uses principal component analysis to create explanatory variables before estimating regression coefficients." There's nothing objectionable in the subsequent sentence "In PCR instead of regressing the dependent variable on the independent variables directly, the principal components of the independent variables are used."

I also don't see anything wrong with your quote from Jolliffe's book (which I haven't read). It is correct that PCR uses principal components of variables as the predictor variables in a regression model.

I don't quite understand what you mean by "regression on PC scores but not PC". You first conduct principal component analysis to create the scores and then use those scores in the regression.

Best Answer

Related Solutions

Solved – What are principal component scores

Solved – Is principal component regression (PCR) using principal component scores for regression

Related Question