Solved – Dealing with data with high variance

optimal-scalingregression

I've a scaling problem. Let's say my target variable is a net revenue column and it has some range of (-34624455, 298878399). So the max-min value is 333502854.

Now in the test set, I have a record and it's revenue value is 2185 which when normalized, converts to 0.1038.

For this record, the predicted value when used a simple linear regression is 0.1037 (unlikely, but let's just assume). This converts to -40209.0402 which is no where near the actual value 2185. I understand that this is because of the crazy range that I've got, but how do I scale this sort of data ? I've tried removing the outliers thinking that it might help in reducing the effect of the range but even in the subset with no outliers, the range is pretty crazy and I still see the same effects where the predicted value in it's normalized/scaled form is close enough to the normalized/scaled actual but when I convert it to original scale, the data is not even close. What kind of scaling techniques should I use for this kind of data?

I used a simple scaling method for now which is (x-min)/(max-min)

Steps listed below:

2185 - (-34624455) = 34626640   # Subtracting the min value
34626640 / 333502854 = 0.103827117  # Dividing with the range

Assume the predicted value is 0.1037

0.1037*333502854 = 34584245.96  #multiplying with the range
34584245.96 + (-34624455) = -40209.0402  # Adding the min value

If I assume the predicted value to be 0.103827116 which is is exactly same as the actual value up until 8 point precision, then the invert scaled value is similar to the actual.

Hope this makes it a bit more clear the problem I am having. I am looking for some pointers on some more appropriate scaling methods as clearly, the min-max and standardized scaling technique are not working for this dataset.

Best Answer

I am not able reproduce the problem you are describing. When testing your problem in python with the following code:

from sklearn.preprocessing import MinMaxScaler

data = [[-34624455], [2185] ,[298878399]]
scaler = MinMaxScaler(feature_range=(0, 1))
print(scaler.fit(data))

data = scaler.transform(data)
print("transform: \n", data)
data = scaler.inverse_transform(data)
print("inverse: \n", data)

I get the following output:

MinMaxScaler(copy=True, feature_range=(0, 1))
transform: [[ 0. ] [ 0.10382712] [ 1. ]]
inverse: [[ -3.46244550e+07] [ 2.18500000e+03] [ 2.98878399e+08]]

which seems to be exactly the behavior we want from the scaler.
However, when I tried the scaling myself using a pocket-calculator, I also got a different result. In that case I would assume it has something to do with the finite precision of floating point arithmetic.
How did you implement the scaling and what program did you use?

Another scaling technique, which you probably already know is mean-scaling, which shouldn't have the same problem. It is difficult however to recommend scaling techniques as I don't know what you need the scaled variables for? Scaling input variables is quite common, but there are not that many uses for scaled target variables (see the discussion here).