Solved – PCA in case of multivariate time series

dimensionality reductionpcatime series

I have a dataset which contains multivariate time series and I would try applying PCA on it, but I'm not sure how to do it.

Consider the following scenario – you are monitoring N parameters of some machine during some time (M samples) and you save this test as an observation. You have Q such tests and you would like to learn something about the behaviour of these machines. But you want to reduce dimensionality first since there is Q time series with N parameters with M samples for each parameter.

How should I do PCA in this case?

My first approach:

  • Take one test (N parameters, M samples)
  • Create samples in the following way

    (p1t1, p2t1, … pNt1)

    (p1t2, p2t2, … pNt2)

    .
    .
    .

    (p1tM, p2tM, … pNtM)

  • Perform PCA on such dataset

And this resulted in getting principle components which basically gave me an answer which of the N parameters are the most important.

But what confuses me – I've done this only for a single test, how to do it for the whole dataset? Would it make sense to extend my approach and just add samples from other time series in the dataset, and then run PCA on all of that?

Another question is – is this a good approach for time series, to split series values into independent samples? Since they may depend on time. I read about time series stationarity, but this seems appropriate in case of domains such as stocks and similar, where there are things like trends and similar. These are just measurements of machine work during some time. Is this approach ok for that?

Any help is greatly appreciated, I'm very confused at the moment.

Best Answer

I think it does make sense to use the approach that you mention and append at the end the points from the other tests. This is because PCA looks to find correlation between columns. So for any given time, if there is a correlation, this correlation should hold for other times and other tests.

So you data matrix would look like:

(p1t1, p2t1, ... pNt1)q1

(p1t2, p2t2, ... pNt2)q1

. . .

(p1tM, p2tM, ... pNtM)q1

(p1t1, p2t1, ... pNt1)q2

(p1t2, p2t2, ... pNt2)q2

. . .

(p1tM, p2tM, ... pNtM)q2 . . .

where q1 refers to test 1 and q2 refers to test 2.

there is a question here on interpreting the meaning of PCA on times series data

EDIT: I haven't looked at this in a while, but I guess it should be mentioned- why do you want to run PCA on this data? You mention

And this resulted in getting principle components which basically gave me an answer which of the N parameters are the most important.

but PCA does not do this. After running PCA you do not get, as an output, which of the N parameters are most important. It is possible, with some investigation, to do this, but if you want to know which parameters are most important (and it is not exactly clear what that means either) then there are other better approaches than running PCA. It really depends what is your goal.

Related Question