Regression – Calculating Predicted Value After Feature Normalization in Multiple Regression

I'm currently doing the Andrew Ng machine learning course on coursera, and in Week2 he discusses feature scaling.

I have seen the lecture and read many posts; I understand the reasoning behind feature scaling (basically to make gradient descent converge faster by representing all the features on roughly the same scale).

My problem arises when I try to do it. I'm using Octave, and I have the the code for gradient descent with linear regression set up: it computes the 'theta' matrix for the hypothesis just fine for non-scaled values, giving accurate predictions.

When I use scaled values of input matrix X and output vector Y, the values of theta and the cost function J(theta) calculated are different than from the un-scaled values. Is this normal? How do I 'undo' the scaling, so that when I test my hypothesis with real data, I get accurate results?

For reference, here is the scaling function I am using (in Octave):

function [scaledX, avgX, stdX] = feature_scale(X)
    is_first_column_ones=0; %a flag indicating if the first column is ones
    sum(X==1)
    if sum(X==1)(1) == size(X)(1)   %if the first column is ones
        is_first_column_ones=1;
        X=X(:,2:size(X)(2));    %strip away the first column;
    end

    stdX=std(X);
    avgX=mean(X);

    scaledX=(X-avgX)./stdX;

    if is_first_column_ones
        %add back the first column of ones; they require no scaling.
        scaledX=[ones(size(X)(1),1),scaledX];
    end
end

Do I scale my test input, scale my theta, or both?

I should also note that I'm scaling as such:

scaledX=feature_scale(X);
scaledY=feature_scale(Y);

where X and Y are my input and output respectively. Each column of X represents a different feature (the first column is always 1 for the bias feature theta0) and each row of X represents a different input example. Y is a 1-D column matrix where each row is an output example, corresponding to the input of X.

eg: X = [1, x, x^2]

 1.00000    18.78152   352.74566
 1.00000     0.61030     0.37246
 1.00000    21.41895   458.77124
 1.00000     3.83865    14.73521

Y =

^ this is for the function y=x^2 – 14x + 10

Best Answer

You should perform feature normalization only on features - so only on your input vector $x$. Not on output $y$ or $\theta$. When you trained a model using feature normalization, then you should apply that normalization every time you make a prediction. Also it is expected that you have different $\theta$ and cost function $J(\theta)$ with and without normalization. There is no need to ever undo feature scaling.

Best Answer

Related Solutions

Solved – Wavelets and machine learning

Solved – Do I apply normalization per entire dataset, per input vector or per feature

Related Question