Autoencoders – How to Prepare Data for Autoencoder Training

anomaly detectionautoencodersdata preprocessing

I have for example 500 engines and for each engine I have 100 features and each feature is measured 1000 times. It means the data table is: 500,000 rows and 100 columns. (for each engine I have only a number, no more metadata)
I want to use autoencoder to detect anomaly. I do not know how to prepare my data for the autoencoder. Should I rearrange the table and get a table with this dimension (500 * 100,000) or should I behave it as a table like (500,000 *100). Or just get a three dimension array which is (500,1000,100). Can someone help me and give me some hint how to behave my data. Many thanks in advance!

Best Answer

First approach (questionable): You look for anomalous measurements (here, a measurement is a single row of 100 values, one for each of the 100 features F1, F2, ...), and then declare an engine as anomalous if it generated an anomalous measurement. I don't know your data, and its volatility per engine, so I cannot say how reasonable this approach is. But if you go for this approach, you would have to arrange your data as a 500,000 x 100 matrix, create an autoencoder with 100 input nodes, and train it by feeding it the rows of your matrix.

Second approach: You presume that all the measurements of an anomalous engine would be anomalous. Then, in theory, you would have to create a matrix of dimensions 500 x 100,000, with each row containing the data of a single engine. This is, however, impractical, because (1) you would have to build an autoencoder with 100,000 input nodes (that's probably too many), and (2) your dataset would be of size 500, which is usually too small for deep learning (you would even have to split it in training and test dataset).

Third approach: Presuming that all the measurements of an anomalous engine are anomalous, I would suggest another approach: First, come up with some summary statistics for the engines, and use those as new features. E.g., compute for each feature the average of this feature (avg(F1), avg(F2), ..., avg(F100)) for every single engine, so that each engine is described by 100 feature averages. This reduces the size of the data row describing an engine from 100,000 (second approach above) to just 100. (Of course, the proper choice of summary features depends on your scenario, and averages might not be the right choice.) That means, you are left with a dataset given by a 500x100 matrix. And since 500 inputs are probably not enough for a deep learning approach, you should now feed this data rather to some classic method, e.g. k-nearest neighbor, isolation forest, one-class SVM, ...

Related Question