Solved – Why do we need to take the transpose of the data for PCA

pca

I have the following dataset: we measured the temperature 1000 times in 9 different stations across the country. The data are represented in a matrix with 9 rows and 1000 columns. I wrote my own implementation of PCA and I have to reduce the dimensionality to 3. I did it and it works but I do not understand some things.

First of all some terminology. On wikipedia I read that the terms variable and observation are usually used. In my case, the observations would be the values of temperature and the variables the 9 stations?

Why do I have to take the transpose of the matrix, obtaining a $1000\times 9$ matrix, before doing the PCA?

Basically what I need to do is to try to keep the information of the original dataset by just using the temperature values of $3$ out of $9$ stations?

Best Answer

We do not need to.

It is a common and long-standing convention in statistics that data matrices have observations in rows and variables in columns. In your case, you indeed have $1000$ observations of $9$ variables. So it would be standard to organize your data in a matrix of $1000\times 9$ size. Most standard PCA implementations will expect to get such an input.

For example, pca() function in Matlab says this on its help page:

coeff = pca(X) returns the principal component coefficients, also known as loadings, for the $n$-by-$p$ data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is $p$-by-$p$.

But if you write your own code for PCA, you are free to follow an opposite convention and store variables in rows. I often did it myself this way.