Solved – Outlier detection in out-sample data for the purpose of classification

classificationMATLABoutliers

This is the most widely used method for outlier detection in econometrics and statistical problems. X is our data that we're searching for outliers in it (in MATLAB) :

 abs(X-mean(X)) >= n*std(X)

So if this inequality was true, that sample is an outlier; otherwise we will keep the sample. I'm using neural network and SVM for my classification problem. Before normalization of data, I find outlier for each feature in input database. After creating my MLP or SVM model now I want use it for out-sample data ( new data from this year – model trained by data of last year ).

  • Should I use outlier detection in this new database?
  • If answer to that question is yes, Should I use mean and std of previous year that was used for outlier detection of main data, or use new mean and std of out-sample data?

Best Answer

The rule for outlier detection you defined before training your model should be applied to all new data you will use, otherwise your model will be confronted with data points it has not be trained to handle. When you apply this rule it should be the exact same rule, i.e. it should use the mean and std from the previous year.

However if you expect significant difference between the means of two successive years this might result in flagging too many data points as outliers. You can check that quickly by looking at the moving average.

Regarding the outlier detection, using one feature at the time might not be optimal, especially if you use multivariate methods afterwards. Some interesting tools re available here in different implementations (MATLAB, R ...).

Related Question