Solved – Supervised machine learning method to forecast energy demand according to multiple variables

machine learningsupervised learning

I need the best machine learning method/algorithm/technique to predict energy consumption. The given training dataset consists of 2 years of energy consumption, with entries every 15 minutes. In addition, weather data (radiation, humidity, temperature and wind speed) are given every hour.

Now as input, I have a few entries with exactly the same format as the training dataset, except, the power consumption has to be predicted as accurately as possible.

The dataset contains several variables: time, day of the week, week of the year, radiation, humidity, temperature, wind speed and power demand

The goal is to be give the same variables but to predict power demand

The algorithm must be:

  • supervised machine learning
  • interpolated the weather (radiation, humidity, temp and wind speed) before training because they are only given per hour while we need per 15minutes
  • divide the historic data into two sets, a training set (from which the application can learn) and a test set, on which to test the accuracy of the forecasts.
  • expect that not only weather but also time of day, day of week and week of the year will play an important role in the forecasting
  • not expect all the variables will play an equal role, including all possible input variables may even reduce performance

P.S. My apologies guys, I have no experience in machine learning at all, I just need a machine learning method/programming library to forecast the most accurate power comsumption, thanks!!

Best Answer

You may want to start by doing some exploratory analysis before you dive into making a prediction model or put this into a prediction modelling framework.

Try to plot the data to see if you can spot whether some trends appear. It is likely that some of the explanatory variables that you have are completely redundant. Depending on the amount of data that you have, this may cause your prediction model to overfit if you do not ignore them.

Energy consumption is most likely dependent on the weather w.r.t. temperature and humidity, (although wind also plays a part). E.g. people turn on their radiators when it is cold and AC when it is warm. Time of day is also important, since when people are not at home during the day, they might not be using as much energy in their homes etc.

Instead of using the time of the day as a variable you can split it into fewer factors, e.g. night, morning, working day, evening. This will help w.r.t. overfitting.

You might also want to introduce factor variables which tell whether a given day is a national holiday or not, i.e. on Christmas or during the super bowl energy consumption will likely spike. It is hardest to model these big spikes/outliers in your data, you need to insert your expert knowledge on the problem into the equation to account for this.

This is not an easy problem and usually the method that you use is not what is most important. What matters most is how you preprocess your data and how you add in your own assumptions about the situations (e.g. the holidays).

The easiest way to go is to use a linear model or a random forest. Random forests are easy to use in most languages and are rather safe for not to overfit.

You can also get something from the random forest which is called variable importance, it shows you how "important" the variables are for making predictions and may help in interpreting the results.

Hope this helps, just don't dump this into some model head first, think about the problem and what matters for these predictions. Also look at the residuals after you have fitted the model.

Related Question