Feature Engineering – Proper Way to Scale Feature Data

feature-engineeringMATLABnormalization

I am using Iris dataset and trying to scale the feature to the range [0,1]. After normalization, I want to binarize the feature. The iris database contains n = 150 examples, each of length d = 4 features. One method is to normalize using the standard deviation applying the formula :

scaled_data = data – mean of data /standard deviation of data

Implementing it gives,

iris_data = load('iris.mtx');
y = load('iris.truth');
y = y + 1;
[n,d] = size(iris_data);
for d = 1:d
scaled_data(:,d) = (iris_data(:,d)- mean(iris_data(:,d))/std(iris_data(:,d))
end

As an example, I generated random numbers from the Uniform distribution in the range [0.1,9] as

a = 0.1;
b = 8;
r = (b-a).*rand(1000,1) + a;
scaled_data = (r - mean(r))/std(r);

But, this does not scale the values to [0,1] !! The output of scaled_data is in [-1.75,1.7]
But, I am unsure if this is the correct way to do so – should I scale across the rows or across the columns and how?

Problem 1: Please help in how to properly scale the feature to the [0,1] range . How to normalize data to 0-1 range?

explains the procedure

data = (data - min(data))/(max(data)-min(data)); 

but what is the name for this procedure and for a feature database, should I be scaling using each example or each feature variable?

Problem 2: then to convert to 0/1.

Few samples from the iris dataset are:

iris_data = 

5.10000000000000    3.50000000000000    1.40000000000000    0.200000000000000
4.90000000000000    3                   1.40000000000000    0.200000000000000
4.70000000000000    3.20000000000000    1.30000000000000    0.200000000000000
4.60000000000000    3.10000000000000    1.50000000000000    0.200000000000000
5                   3.60000000000000    1.40000000000000    0.200000000000000

Best Answer

First: normalizing data is usually done per feature, that is, all instances of one feature (1 column of the 4 or 5 columns in iris) over all samples (150 rows in iris). So, normalizing over each individual column is correct in your example. To be specific, for normalization by the mean (usually referred to as "shift"), one step would be: take e.g. the Sepal.Length feature and compute the mean over all 150 Sepal.Length values in your dataset. Then divide all 150 Sepal.Length values by exactly this value.

Normalizing data by the mean and standard deviation (SD) will transform data so that its new mean=0 and SD=1: so the range is intentionally not [0,1], which is correct and OK for many models. Further, a [0,1] range instead of normalization by mean and SD has the downside that it is unstable, as strongly influenced by any outliers. Those outliers will determine the shift/scale of your data - instead of the "majority" of the data - therefore might prevent your transformed data looking anything close to a normal distribution, which would usually be desirable.

For your example in Matlab/Octave$^1$: you seem to mix / (matrix-wise operation) and ./ (piecewise opertation). The first example snippet normalizes iris_data to mean=0 and SD=1 (be aware that I only used iris column 1 to 4, as column 5 is categorial)

> iris_data2 = (iris_data - mean(iris_data)) ./ std(iris_data);
>  mean(iris_data2)

ans =

  -1.4572e-15  -1.6383e-15  -1.2923e-15  -5.5437e-16

> std(iris_data2)

ans =

   1.00000   1.00000   1.00000   1.00000

This second example snippet one does the same, but using a target range of [0,1] instead:

>  iris_data3 = (iris_data - min(iris_data)) ./ (max(iris_data)-min(iris_data));
> max(iris_data3)

ans =

   1   1   1   1

> min(iris_data3)

ans =

   0   0   0   0

For the question on how to name those: I would usually refer to them as "normalization to mean=0 and SD=1" and "normalization to range [0,1]", as it is usually clear enough what is meant by it and how it is done on the data - but this might be different for your audience (of this statement).

For binarization of features: this requires a threshold to be defined: values below and above it will become 0 and 1 through binarization. With a range of [0,1], a threshold of 0.5 could be reasonable (could be adapted to better suit your needs):

> iris_data4 = iris_data3 > 0.5;
> iris_data4

iris_data4 =

   0   1   0   0
   0   0   0   0
   0   0   0   0
   0   0   0   0
   0   1   0   0
   [...]

$^1$ I tried it with Octave, so if there should be problems in Matlab please let me know.

Related Question