Is it generally advised to remove redundant information (if highly
correlated) when training neural networks?
It depends.
Is it necessary? No.
Since a neural network with an appropriate architecture can model any
(!) function, you can safely assume, that it also could first model
the PAT and then do whatever it also should do -- e.g. classification,
regression, etc. (source)
and
In principal, the linear transformation performed by PCA can be
performed just as well by by the input layer weights of the neural
network, so it isn't strictly speaking necessary (source)
This is because Neural Nets could be used as a non-linear dimensionality reduction tool:
High-dimensional data can be converted to low-dimensional codes by
training a multilayer neural network with a small central layer to
reconstruct high-dimensional input vectors (source)
In this context, it is also worth mentioning auto-encoders.
Can it help? Yes, it speeds up things.
However, as the number of weights in the network increases, the
amount of data needed to be able to reliably determine the weights of
the network also increases (often quite rapidly), and over-fitting
becomes more of an issue (using regularisation is also a good idea).
The benefit of dimensionality reduction is that it reduces the size of
the network, and hence the amount of data needed to train it (source)
Speed comes at a cost /slash/ bears a risk
The disadvantage of using PCA is that the discriminative information
that distinguishes one class from another might be in the low variance
components, so using PCA can make performance worse (source)
This might really be what you have experienced in your experiment.
I think Arthur's answer is quite valuable here, but I think some more detail can be added.
Let's keep in mind that feature engineering is not the same as data collection. You should absolutely collect more data and more variables/information if you can. However, at the stage of feature engineering, we're assuming you already have all the raw data you could get.
During feature engineering, your job is to select, transform and combine these raw data variables in ways that are helpful to the model.
The process that Arthur outlines is absolutely correct - to engineer good features, you need to have some subject matter input (someone who understands exactly what each variable means, how it was gathered etc.) . This way, you can combine this expertise with some creativity to find new ways of representing the data.
Let's move away from these somewhat abstract definitions, and let me give you some potential ideas for features.
Most fundamental thing to keep in mind: It's a huge range of potential here. As long as you're not leaking data, you can try literally whatever you can think of. There are no boundaries. If you have an idea that you think makes sense in the context of the problem, try it!
Some ideas I had, given your dataset (adjust these to your understanding of the data, as I might not fully know your situation):
- Average booking QTY of the customer (up until a point in time, remember - no data leaking from the future, don't average across the whole dataset, only the data you would've had at that point)
- Average size of order or price of item of the customer (up until a point in time, remember - no data leaking from the future, don't average across the whole dataset, only the data you would've had at that point. As I hope you realize, this comment applies to ALL the features I mention, and all features you can think of.)
- Number of orders this customer has made with you
- With your date_diff variables, you could again take averages (like average time to respond to this customer etc., rather than just the most recent response time)
- You could take standard deviations of these date_diff variables (to see how consistent it is over time)
- If you have the data for it, you could look at average of actual revenue realized vs. your "revenue expected" column ("On average, the realized revenue is x% of expected revenue, for this customer/category)
- You could look at any of the above variables and calculate them only for the customers within a certain category, and then compare your given customer to all other customers in that category (so for example, look at the number of orders customer X, in category Z, has made with you COMPARED to the average number of orders ALL customers in category Z have made with you. You could express this number as a percentile).
I think this gives some pointers and inspiration!
Best Answer
I don't think this can be answered in the abstract, we would need to know what is the goal, and what is the variables. Some possibilities:
Two variables which basically are different measurements of the same thing with independent errors. Then keep both (or use their mean) looks reasonable. Other cases are less clear-cut. Since you have a classification problem, the following is relevant: Feature Selection: Correlation and Redundancy even highly correlated variables could have non-redundant information, and in the example given removing any of them would destroy its information content. But, interestingly, in that example replacing the two variables with their difference would work! That underscores that each example have to be treated on its own. Some general threshold value which would always work would be difficult to find!