Solved – How to transform time series data so I can use simpler techniques for fault prediction

data transformationmachine learningtime series

I know this is primarily a statistics site, so if I am off-topic, please redirect me.

I have a system with pumps that sometimes break and need to be replaced. I would like to be able to predict the failures, and thereby give early warning to the people replacing the pumps. I have historical data for the pump process, such as flow, pressure, liquid height etc.

I have only a small amount of experience in using machine learning techniques to classify data – basically I have followed and done the exercises of Andrew Ng's machine learning course on coursera, as well as Andrew Conway's Statistics One, – and I have never used machine learning to classify time series. I am thinking of ways I can transform the my problem so that I can use my existing knowledge on it. With my limited knowledge, I will not get a very optimal prediction, but I hope to learn from this, and for this problem, any small improvement in prediction is useful, versus just waiting for the faults to occur.

My proposed approach is to turn the time series into a normal classification problem. The input would be a summary of a time series window, with mean value, standard deviation, max values etc. for each type of data in the window. For the output, I am not sure what would work best. One approach is that the output would be a binary classification of whether the pump failed within a certain time period from the end of the window or not. Another is that the output would be the time left before the pump fails, so not a classification, but a regression (in the machine learning sense) instead.

Do you think this approach is likely to yield results? Is it a question of "depends on the domain and historical data". Are there better transforms (of both input and output) that I haven't considered, or is fault prediction based on time series data so different from more standard fault prediction, that my time would be better spent reading up on machine learning with time series?

Best Answer

You may want to look at survival analysis, with which you can estimate the survival function (the probability that the time of failure is greater than a specific time) and the hazard function (the instantaneous probability that a unit will fail, given it has not experienced failure so far). With most survival analysis approaches you can enter time-invariant and time-varying predictors.

There are a variety of different survival analysis approaches including the semi-parametric Cox proportional hazards model (a.k.a. Cox regression) and parametric models. Cox regression doesn't require you to specify the underlying base hazard function but you might find that you need a parametric model to properly capture the failure patterns in your data. Sometimes parametric accelerated failure time models are appropriate, where the rate of failure increases over time.

You might try starting with Cox regression since it is the simplest to use and check how well you can predict failure on a holdout test set. I suspect you may have better results with some sort of survival analysis that explicitly takes into account time and censoring (pumps that have not failed yet) than with trying to turn this into a non-time-based classification problem.

Related Question