Solved – Why feature transformation is needed in machine learning & statistics? Doesn’t it affect the “interaction” between features

data transformationfeature-engineeringmachine learningmathematical-statistics

Before feeding machine learning models, we can do data transformation and feature scaling depending on data distribution.

For example, if a column is skewed, we can use Box-Cox transformation to reduce its skewness:
https://machinelearningmastery.com/power-transform-time-series-forecast-data-python/

But do these feature transformation affect the interaction between features? And thus leading to bad predicting power of the ML models?

After I performed box-cox transformation on several columns in a dataset, I found the correlation coefficients also changed. This gave me an "impression" that the data is altered way too much. But on the other hand, as long as I also perform the transformation on test data, the model learned with the transformed training data should also work on the test data.

Standardization doesnot affect the correlations between features.
http://rstudio-pubs-static.s3.amazonaws.com/318113_6581029a53064b988b700fc3eee55864.html#

Best Answer

Your question asks about feature transformation in machine learning & statistics. This is actually two separate questions, in line with the two main uses of statistical techniques, To Explain, and To Predict.

Feature transformation on explanatory statistics

You're using a relatively simple statistical technique, like linear regression or PCA, to separate out and explain the various effects you're seeing in the data. Any transformation you perform on the data is something you need to keep track of and include in your final equation.

An example: Imagine you have independent variables $x_1,x_2,x_3$ and dependent variable $y$. Your first step will probably be to do a linear regression to find an equation $y = ax_1 + bx_2 + cx_3 +d$. You do this and find that the $R^2$ is really poor. You plot the raw data and find that there is a lot of heterscedasticity to it, and that the data points are skewed strongly, with some points occurring at orders of magnitude higher than others. This observation leads you to try doing the regression on the logs of the variables $x_1,x_2,x_3,y$.

$log(y) = alog(x_1) + blog(x_2)+ clog(x_3) +d$.

This approach gives you a much higher $R^2$. But now you've actually changed the equation. If you exponentiate everything, you'll find this is equivalent to $y = x_1^a x_2^b x_3^c e^d$. The variables are now multiplied together rather than summed. You've found a different and better equation to fit the data with. By transforming the data, you change the model, and therefore change the explanation.

Feature transformation on predictive statistics.

You're using some form of machine learning process to spit out answers from a set of inputs. Your model is a black box and no one cares what's inside it as long as it is spitting out accurate answers quickly. Feature transformation can be considered part of this black box - the raw info is getting chopped up and compressed anyway so it doesn't matter in the slightest that it goes through a transformation before being fed into the machine.

Feature transformation is important for the training of machine learning tools, usually for mundane, practical reasons. For example, with neural network training (especially if you're using sigmoid or tanh activation functions), you want all your input variables to be transformed so that they are somewhere close to zero (usually between -2 and +2). There is no fundamental statistical reason for this, it's just to make the training process go faster.

Neural networks, like many complicated statistical techniques, use some form of gradient descent to optimise their parameters and actually 'learn'. Gradient descent requires the local topography of the parameter space to be sloped so that it can follow the direction of the slope. Inputting a very large positive or negative number into a sigmoid function results in a number that is very close to either 0 or 1, where the local topography of the parameter space is almost flat. This will cause the algorithm either to learn extremely slowly, or decide it's already at the optimum and declare that it's done learning. Both of these are bad.

Feature transformations on machine learning training are used to massage the data into a format that is better conducive to rapid learning.