From Modern Epidemiology 3rd Edition by Rothman, Greenland and Lash:
There are at least three forms of overmatching. The first refers to matching that harms statistical efficiency, such as case-control matching on a variable associated with exposure but not disease. The second refers to matching that harms validity, such as matching on an intermediate between exposure and disease. The third refers to matching that harms cost-efficiency.
The answer from AndyW is about the second form of overmatching. Briefly, here's how they all work:
1: In order to be a confounder, one of the criteria is that the covariate be associated with both the outcome and the exposure. If it's only associated with one of them, its not a confounder, and all you've succeeded in doing is widening your confidence interval.
To explore this type of overmatching further, consider a matched case-control study of a binary exposure, with one control matched to each case on one or more confounders. Each stratum in the analysis will consist of one case and one control unless some strata can be combined. If the case and its matched control are either both exposed or both unexposed, one margin of the 2 x 2 table will be 0 ... such a pair of subjects will not contribute any information to the analysis. If one stratifies on correlates of exposure, one will increase the chance that such tables will occur and thus tend to increase the information lost in stratified analysis.
2: This is partially discussed by AndyW. Matching on an intermediate factor will bias your estimate, as will matching on something affected by both the exposure and outcome. This is essentially controlling on a collider, and any technique that does so will bias your estimate.
If, however, the potential matching factor is affected by exposure and the factor in turn affects disease (i.e., is an intermediate variable), or is affected by both exposure and disease, then matching on the factor will bias both the crude and adjusted effect estimates. In these situations, case-control matching is nothing more than an irreparable form of selection bias.
3: This is more of a study design problem. Extensively matching on variables that you needn't match on for reasons 1 & 2 can cause you to reject easily obtained controls (friends, family, nearby social network, etc.) in favor of far harder to obtain controls that can be matched on the unnecessary set of covariates. That costs money - money that could have been spent on more subjects, better exposure or disease ascertainment, etc., for no appreciable gain in bias or precision, and indeed having threatened both.
You may want to start by doing some exploratory analysis before you dive into making a prediction model or put this into a prediction modelling framework.
Try to plot the data to see if you can spot whether some trends appear. It is likely that some of the explanatory variables that you have are completely redundant. Depending on the amount of data that you have, this may cause your prediction model to overfit if you do not ignore them.
Energy consumption is most likely dependent on the weather w.r.t. temperature and humidity, (although wind also plays a part). E.g. people turn on their radiators when it is cold and AC when it is warm. Time of day is also important, since when people are not at home during the day, they might not be using as much energy in their homes etc.
Instead of using the time of the day as a variable you can split it into fewer factors, e.g. night, morning, working day, evening. This will help w.r.t. overfitting.
You might also want to introduce factor variables which tell whether a given day is a national holiday or not, i.e. on Christmas or during the super bowl energy consumption will likely spike. It is hardest to model these big spikes/outliers in your data, you need to insert your expert knowledge on the problem into the equation to account for this.
This is not an easy problem and usually the method that you use is not what is most important. What matters most is how you preprocess your data and how you add in your own assumptions about the situations (e.g. the holidays).
The easiest way to go is to use a linear model or a random forest. Random forests are easy to use in most languages and are rather safe for not to overfit.
You can also get something from the random forest which is called variable importance, it shows you how "important" the variables are for making predictions and may help in interpreting the results.
Hope this helps, just don't dump this into some model head first, think about the problem and what matters for these predictions. Also look at the residuals after you have fitted the model.
Best Answer
Confounding plays a large role in statistics because we are looking to identify the exact effect of a set of variables on another. If confounding variables are left out of a statistical model then the effect measured for the variables that were included may be biased.
Confounding is not as a big a problem when performing prediction, because we are not concerned with identifying the exact effect of a variable on another. We are simply looking to find out what is the `most likely' value of a dependent variable given a set of predictors.
So for example, suppose that we would like to estimate to what a degree a person's age is effects their salary. So we can estimate the model: \begin{equation} \text{salary}_i = \beta_ 0 + \beta_1 \text{age}_i + \varepsilon_i. \end{equation} It is very likely that $\beta_1$ in the equation above will be positive and fairly large, because older people tend to have more education and more work experience. So if we wish pin-point the link between age and salary, we should probably control for these confounders, estimating the model: $$ \text{salary}_i = \beta_ 0 + \beta_1^* \text{age}_i + \beta_2 \text{education}_i + \beta_3\text{experience}_i + \varepsilon_i. $$ It is very likely that $\beta_1^* < \beta_1$ and that $\beta_1^*$ will be a much better estimator for the pure effect of age on one's earnings. That, in the sense of `change someone's age and keep EVERYTHING else fixed'. However, since age is highly correlated with education and experience, the first model might just be good enough for predicting a person's salary.