Solved – Robust Principal Component Analysis for Anomaly Detection

anomaly detectiondimensionality reductionmachine learningmathematical-statisticsoutliers

I went through the algorithm and some papers for the Robust PCA and although i understood it, that a matrix M is composed of a lower rank matrix L and a sparse matrix S. If i'm correct it is the sparse matrix S that can be used to identify the outliers. Now I have two questions from it and any help from statexchange community would be appreciated:

Question 1: . If the data is 1-d time series, how am i going to convert it into 2-d matrix. I had tried two approaches, one was, lets say i have 7 days of data and each day i collect 24 data points (just saying) so what i did was reshaped my 1-d vector into a 2-d array of shape (# records per day, #days) i.e (24,7) but what it means is 1st column will consist only of 1st day data, 2nd column will contain only second day data and so on.
And the second approach was that i just simply reshape my 1-d time series vector of shape (168,1) into a 2-d matrix of (24,7) but in this case the enteries were spread horizontally i.e the first 7 enteries of 1st day on 1st row, entry 7-14 on row 2, entry 15-21 on row 3 and so on for each day.

However, when i applied the R-PCA algorithm on it, the reconstructed values from 2-d matrix obtained from 2nd approach was what was matching the almost matching original time series (the purpose of PCP), whereas the reconstructed (L+S) values from matrix reshaping from 1st way was almost random and didn't really match the values of the original matrix M. Any idea how and why did this happen?

Question 2: . How would i use the sparse matrix to determine the anomalies. I'm thinking something on the lines of some distance measure, but any insights/references on this would also be welcome.

Best Answer

I went through this papers and others, and used Robust PCA for my own needs. Additionaly to Candes et al., you can take a look to the implementation suggested by Lin et al. (2013): https://arxiv.org/pdf/1009.5055.pdf. Besides detecting outliers, one of the problem formulation also allows you to complete missing entries. You can find a nice Python implementation of the Robust PCA problem formulation here.

Question 1: From what I understand, your 1d time series is a one-day pattern repeated over one complete week, plus noise. Therefore, as you would do using classic PCA, your samples are each day of your time series, and your features are each hours of the days, so you should reshape your data as a 7 x 24 matrix (n_samples x n_features). Indeed, like PCA, your samples (rows) are supposedly strongly correlated with each other, and so you can find a faithful low rank representation of your time series. Robust PCA comes in handy as it is not as strongly affected by outliers as PCA, where strong outliers might influence the main direction of variance.

Before applying Robust PCA to your data, you should also look at preprocessing steps, such as making your time series stationary, center each day, and so on.

Question 2: The implementation from Lin et al. (2013) apply an adaptative threshold to filter $S$ entries, and keeping only the entries above the threshold. More details in the paper.

This means that the work of detecting outliers is a priori done in the $S$ computation steps. Theoretically, filtering non zero values in $S$ matrix should give you the outliers in your data. However, with noisy or non smooth data, the sparse matrix $S$ may contain too many outliers. If you feel you detect too many outliers, one idea would be to apply a quantile based anomaly detection on the flattened sparse matrix. Let $\sigma = q_{75} - q_{25}$ your inter quartile range, then you could detect all values which distance to zero is more than $3 \times \sigma$.

Related Question