It seems as if deep learning might be very interesting for you. This is a very recent field of deep connectionist models which are pretrained in an unsupervised way and fine tuned afterwards with supervision. The fine tuning requires a much less samples than the pretraining.
To wet your tongue, I recommend [Semantig Hashing Salakhutdinov, Hinton. Have a look at the codes this finds for distinct documents of the Reuters corpus: (unsupervised!)
If you need some code implemented, check out deeplearning.net. I don't believe there are out of the box solutions, though.
Machine learning (ML) in practice depends on what the goal of doing ML is. In some situations, solid pre-processing and applying a suite of out-of-the-box ML methods might be good enough. However, even in these situations, it is important to understand how the methods work in order to be able to troubleshoot when things go wrong. However, ML in practice can be much more than this, and MNIST is a good example of why.
It's deceptively easy to get 'good' performance on the MNIST dataset. For example, according to Yann Le Cun's website on MNIST performance, K nearest neighbours (K-NN) with the Euclidean distance metric (L2) also has an error rate of 3%, the same as your out-of-the-box random forest. L2 K-NN is about as simple as an ML algorithm gets. On the other hand, Yann, Yoshua, Leon & Patrick's best, first shot at this dataset, LeNet-4, has an error rate of 0.7%, 0.7% is less than a fourth of 3%, so if you put this system into practice reading handwritten digits, the naive algorithm requires four times as much human effort to fix its errors.
The convolutional neural network that Yann and colleagues used is matched to the task but I wouldn't call this 'feature engineering', so much as making an effort to understand the data and encode that understanding into the learning algorithm.
So, what are the lessons:
- It is easy to reach the naive performance baseline using an out-of-the-box method and good preprocessing. You should always do this, so that you know where the baseline is and whether or not this performance level is good enough for your requirements. Beware though, often out-of-the-box ML methods are 'brittle' i.e., surprisingly sensitive to the pre-processing. Once you've trained all the out-of-the-box methods, it's almost always a good idea to try bagging them.
- Hard problems require either domain-specific knowledge or a lot more data or both to solve. Feature engineering means using domain-specific knowledge to help the ML algorithm. However, if you have enough data, an algorithm (or approach) that can take advantage of that data to learn complex features, and an expert applying this algorithm then you can sometimes forego this knowledge (e.g. the Kaggle Merck challenge). Also, sometimes domain experts are wrong about what good features are; so more data and ML expertise is always helpful.
- Consider error rate not accuracy. An ML methods with 99% accuracy makes half the errors that one with 98% accuracy does; sometimes this is important.
Best Answer
All unsupervised algorithms, e.g.
Some of them might internally use regression or classification elements, but the algorithm itself is neither.