Regression – Calculating Predicted Value After Feature Normalization in Multiple Regression

machine learningmultidimensional scalingmultiple regressionregression

I'm currently doing the Andrew Ng machine learning course on coursera, and in Week2 he discusses feature scaling.

I have seen the lecture and read many posts; I understand the reasoning behind feature scaling (basically to make gradient descent converge faster by representing all the features on roughly the same scale).

My problem arises when I try to do it. I'm using Octave, and I have the the code for gradient descent with linear regression set up: it computes the 'theta' matrix for the hypothesis just fine for non-scaled values, giving accurate predictions.

When I use scaled values of input matrix X and output vector Y, the values of theta and the cost function J(theta) calculated are different than from the un-scaled values. Is this normal? How do I 'undo' the scaling, so that when I test my hypothesis with real data, I get accurate results?

For reference, here is the scaling function I am using (in Octave):

function [scaledX, avgX, stdX] = feature_scale(X)
    is_first_column_ones=0; %a flag indicating if the first column is ones
    sum(X==1)
    if sum(X==1)(1) == size(X)(1)   %if the first column is ones
        is_first_column_ones=1;
        X=X(:,2:size(X)(2));    %strip away the first column;
    end

    stdX=std(X);
    avgX=mean(X);

    scaledX=(X-avgX)./stdX;

    if is_first_column_ones
        %add back the first column of ones; they require no scaling.
        scaledX=[ones(size(X)(1),1),scaledX];
    end
end

Do I scale my test input, scale my theta, or both?

I should also note that I'm scaling as such:

scaledX=feature_scale(X);
scaledY=feature_scale(Y);

where X and Y are my input and output respectively. Each column of X represents a different feature (the first column is always 1 for the bias feature theta0) and each row of X represents a different input example. Y is a 1-D column matrix where each row is an output example, corresponding to the input of X.

eg: X = [1, x, x^2]

 1.00000    18.78152   352.74566
 1.00000     0.61030     0.37246
 1.00000    21.41895   458.77124
 1.00000     3.83865    14.73521

Y =

99.8043
 1.8283
168.9060
-29.0058

^ this is for the function y=x^2 – 14x + 10

Best Answer

You should perform feature normalization only on features - so only on your input vector $x$. Not on output $y$ or $\theta$. When you trained a model using feature normalization, then you should apply that normalization every time you make a prediction. Also it is expected that you have different $\theta$ and cost function $J(\theta)$ with and without normalization. There is no need to ever undo feature scaling.