I have been working on Machine learning and noticed that most of the time, dimensionality reduction techniques like PCA and t-SNE are used in machine learning. But, i rarely noticed anyone doing it for deep learning projects. Is there a specific reason for not using Dimensionality reduction techniques in deep learning?
Neural Networks – Are Dimensionality Reduction Techniques Useful in Deep Learning?
dimensionality reductionneural networkspcatsne
Related Solutions
Before I attempt to answer your question I want to create a stronger separation between the methods you are referring to.
The first set of methods I believe you are referring to are neighborhood based dimensionality reduction methods, where a neighborhood graph is constructed where the edges represent a distance metric. Now to play devil's advocate against myself, MDS/ISOMAP can both be interpreted as a form of kernel PCA. So although this distinction seems relatively sharp, various interpretation shift these methods from one class to another.
The second set of methods you are referring to I would place in the field of unsupervised neural network learning. Autoencoders are a special architecture that attempts to map an input space into a lower-dimensional space that allows decoding back to the input space with minimal loss in information.
First, let's talk about benefits and drawbacks of autoencoders. Autoencoders are generally trained using some variant of stochastic gradient descent which yields some advantages. The dataset does not have to fit into memory, and can dynamically be loaded up and trained with gradient descent. Unlike a lot of methods in neighborhood-based learning which forces the dataset to exist in memory. The architecture of the autoencoders allows prior knowledge of the data to be incorporated into the model. For example, if are dataset contains images we can create an architecture that utilizes 2d convolution. If the dataset contains time-series that have long term connections, we can use gated recurrent networks (check out Seq2Seq learning). This is the power of neural networks in general. It allows us to encode prior knowledge about the problem into our models. This is something that other models, and to be more specific, dimensionality reduction algorithms cannot do.
From a theoretical perspective, there are a couple nice theorems. The deeper the network, the complexity of functions that are learnable by the network increases exponentially. In general, at least before something new is discovered, you are not going to find a more expressive/powerful model than a correctly selected neural network.
Now although all this sounds great, there are drawbacks. Convergence of neural networks is non-deterministic and depends heavily on the architecture used, the complexity of the problem, choice of hyper-parameters, etc. The expressiveness of neural networks causes problems too, they tend to overfit very quickly if the right regularization is not chosen/used.
On the other hand, neighborhood methods are less expressive and tend to run a deterministic amount of time until convergence based on much fewer parameters than neural networks.
The choice of method depends directly on the problem. If you have a small dataset that fits in memory and does not utilize any type of structured data (images, videos, audio) classical dimensionality reduction would probably be the way to go. But as structure is introduced, the complexity of your problem increases, and the amount of data you have grows neural networks become the correct choice.
Hope this helps.
You have many options. You could check the correlation between features and remove features which are highly correlated with other ones. You could build a random forest with the data and observe the feature importances that result from this, and remove the ones which have low importance. You could do something similar with logistic regression to take a subset of the features as well. Here is a nice discussion of that. Importance of variables in logistic regression
Best Answer
$t$-SNE
Two obvious reasons that tsne is not commonly used as a dimension reduction method is that it is non-deterministic and it can't be applied in a consistent fashion to test-set data. See: Are there cases where PCA is more suitable than t-SNE?
PCA
First, pca is not inherently a dimensionality reduction method. It's a method that makes a new matrix of the same size, represented in a decorrelated basis. Truncated PCA reduces the rank of that matrix, so it is reduced in dimension.
Second, even if you do not use PCA to reduce dimensionality, it can still be useful. In "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", Sergey Ioffe and Christian Szegedy suggest that whitening transformation are helpful during the optimization steps.
Clearly, PCA yields decorrelated vectors and subtracting the mean and rescaling by the standard deviation achieves the rest. This quotation suggests that pre-whitening the input data might give your model a nice boost in terms of training time.
Whether or not whitening is helpful for any particular model is, obviously, problem-specific. One very common deep learning application is computer-vision. These networks tend not to use whitening transformations because the transformation to an orthogonal basis changes the image in a way which might not actually be useful to whatever network you're using. I'm not aware of an example where PCA improves a modern deep neural network for image classification, but that's probably due to a limitation of my knowledge; I'm sure someone will post a recent conv-neural-network paper that uses PCA in a comment.
Moreover, truncated PCA of an image will, obviously, distort the image in some way, with the amount of distortion depending on the number of PCs that you retain.
On the other hand, a great reason to use truncated PCA for dimensionality reduction is when your data is rank-deficient. It's common for hand-crafted feature vectors, such as those used in a feed-forward network, to have a certain amount of redundancy. Presenting all of these features to your network unnecessarily increases the number of parameters, so it can be more efficient to drop them.
Common Sense
If we take a wider view of dimensionality reduction, we can still reduce the dimension of our data by using common sense.
Consider the MNIST task. The digits occupy the center of the image. If you look at the whole data set, you can find that there are some pixels around the periphery of each image which are always white. If you trim each image to exclude these always-white pixels, you've taken a significant step towards reducing how much computational power you need, since all of these pixels are now effectively "skipped over". "Always white" pixels have no useful information for the network because the pixel values are constant in all samples, so you're not losing any distinguishing information.