First, let's define a score.
John, Mike and Kate get the following percentages for exams in Maths, Science, English and Music as follows:
Maths Science English Music
John 80 85 60 55
Mike 90 85 70 45
Kate 95 80 40 50
In this case there are 12 scores in total. Each score represents the exam results for each person in a particular subject. So a score in this case is simply a representation of where a row and column intersect.
Now let's informally define a Principal Component.
In the table above, can you easily plot the data in a 2D graph? No, because there are four subjects (which means four variables: Maths, Science, English, and Music), i.e.:
- You could plot two subjects in the exact same way you would with $x$ and $y$ co-ordinates in a 2D graph.
- You could even plot three subjects in the same way you would plot $x$, $y$ and $z$ in a 3D graph (though this is generally bad practice, because some distortion is inevitable in the 2D representation of 3D data).
But how would you plot 4 subjects?
At the moment we have four variables which each represent just one subject. So a method around this might be to somehow combine the subjects into maybe just two new variables which we can then plot. This is known as Multidimensional scaling.
Principal Component analysis is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retain maximal amount of information about the variables. For example, this would mean we could look at the types of subjects each student is maybe more suited to.
A principal component is therefore a combination of the original variables after a linear transformation. In R, this is:
DF <- data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80),
English=c(60, 70, 40), Music=c(55, 45, 50))
prcomp(DF, scale = FALSE)
Which will give you something like this (first two Principal Components only for sake of simplicity):
PC1 PC2
Maths 0.27795606 0.76772853
Science -0.17428077 -0.08162874
English -0.94200929 0.19632732
Music 0.07060547 -0.60447104
The first column here shows coefficients of linear combination that defines principal component #1, and the second column shows coefficients for principal component #2.
So what is a Principal Component Score?
It's a score from the table at the end of this post (see below).
The above output from R means we can now plot each person's score across all subjects in a 2D graph as follows. First, we need to center the original variables by subtracting column means:
Maths Science English Music
John -8.33 1.66 3.33 5
Mike 1.66 1.66 13.33 -5
Kate 6.66 -3.33 -16.66 0
And then to form linear combinations to get PC1 and PC2 scores:
x y
John -0.28*8.33 + -0.17*1.66 + -0.94*3.33 + 0.07*5 -0.77*8.33 + -0.08*1.66 + 0.19*3.33 + -0.60*5
Mike 0.28*1.66 + -0.17*1.66 + -0.94*13.33 + -0.07*5 0.77*1.66 + -0.08*1.66 + 0.19*13.33 + -0.60*5
Kate 0.28*6.66 + 0.17*3.33 + 0.94*16.66 + 0.07*0 0.77*6.66 + 0.08*3.33 + -0.19*16.66 + -0.60*0
Which simplifies to:
x y
John -5.39 -8.90
Mike -12.74 6.78
Kate 18.13 2.12
There are six principal component scores in the table above. You can now plot the scores in a 2D graph to get a sense of the type of subjects each student is perhaps more suited to.
The same output can be obtained in R by typing prcomp(DF, scale = FALSE)$x
.
EDIT 1: Hmm, I probably could have thought up a better example, and there is more to it than what I've put here, but I hope you get the idea.
EDIT 2: full credit to @drpaulbrewer for his comment in improving this answer.
Best Answer
The PC scores are in inherently on different scales; we can actually find the scale of each one by checking its corresponding eigenvalue. That said, no, do not normalise the PC scores. The suggestion you read relates to original input data prior to PCA. Given that your current input data are already valid PCA scores (i.e. created using a normalised sample as input) there is no reason to renormalise them again. If anything that will distort their importance.
Usual t-SNE implementations perform a PCA step internally to bring the dimensionality of the input data to a reasonable number. In R, the
Rtsne::Rtsne()
function by default uses $50$ dimensions as a "reasonable number of dimensions", in the 2008 and 2014 JMLR papers by van der Maaten this number is $30$. In any case though, we already provide PC scores as input we can skip that step. Performing PCA on PC scores will result to identical outputs (up to the sign).