Regression – Should Target Data be Normalized Along with Input Data

neural networksregression

I consider myself an intermediate practitioner of neural networks. I've been asked to teach a few of my colleagues some of what I know. Some of my practices may be a bit idiosyncratic, because I study and tinker on my own. I want to teach "best practices," and when I'm in doubt, I'm trying to research exactly what those are.

I'm working with the housing price data set from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. I'm trying to predict housing prices from the other information in the data set, a pretty classic sort of student problem. The minimum possible housing price is obviously zero, and the mean housing price in this data set is around $250,000. I already have human-readable output from a linear regression model which reports its results in dollars. I would like to preserve that in subsequent models, so we can also do things like compare mean squared errors between architectures.

Knowing my data, I would be inclined to do one or both of the following:

  • On the output layer, which is a single node with linear activation,
    set the initial bias to the mean value of 250,000.

  • Add an exponential activation function on the output node, to make
    negative prices impossible.

I did a combination of these two things, and designed some demonstration models with very few non-linear ReLU elements which outperform the linear regression after only a few dozen epochs. That's what I wanted to demonstrate.

However, I've never seen any published examples of people doing this kind of hand-tinkering. So I tried leaving off my customization, and repeated the training process. After 2,000 epochs, the network was still working the outputs slowly up from zero towards the mean 250,000 value, and the training errors were still stupidly high. I was using ADAM for gradient descent. I thought that ADAM would have found the mean much more quickly than this because of its momentum feature.

I think I can also improve my training process by training on normalized targets. If the output targets are near zero, and the standard deviation is near 1, the typical weight initialization schemes should find it easy to start training on actual features in the data immediately, rather than wasting epochs simply grinding towards the mean.

If I train to a normalized output, the meaning of the internal mean-squared error in the neural model training process will no longer be comparable to the MSE from the linear regressions. I will also need to perform the inverse of the data normalization process on the output to produce a human-readable result. Both of those tradeoffs are acceptable to me, but I'll have to walk my students through that additional layer of complexity.

What's the "right" way? Thanks for your advice.

Best Answer

You are right, normalising the data has the advantage of the weight initialisation you mentioned but another thing is that if you have multiple features at different scales, then it would be much harder for weight to adjust to all their scales. This is simply an optimisation issue.

ReLU already makes the output non-negative so I think that's already done. What you mentioned about setting the bias to the mean is precisely what normalisation attempts to solve. When we normalise, the inputs essentially represent deviations. It's the deviations that cause changes to the output. If there is some input which is 0, i.e it is the mean , then our network will have no use of weights in the first layer and only carry forward the biases. This calculation will is expected to map onto the output mean(but not always) . So the biases will incorporate the mean and then the weights will model how deviations from the mean of the input affect the output.

If you want to compare, convert outputs back to normal scale and recalculate the MSE. Normalising and converting back is the norm.

Related Question