Genetic algorithms were used to lower the prime gap to 4680 in the recent Zhang twin primes proof breakthrough and associated Polymath project. The bound has been lowered by other methods but it shows some potential for machine learning approaches in this or related areas. they can be used to devise/optimize effective "combs" or basically sieves for analyzing/screening smallest-possible prime gaps.
Together and Alone, Closing the Prime Gap (Erica Klarreich, Quanta magazine, 19 November 2013):
The team eventually came up with the Polymath project’s record-holder — a 632-tooth comb whose width is 4,680 — using a genetic algorithm that “mates” admissible combs with each other to produce new, potentially better combs.
Machine learning (ML) in practice depends on what the goal of doing ML is. In some situations, solid pre-processing and applying a suite of out-of-the-box ML methods might be good enough. However, even in these situations, it is important to understand how the methods work in order to be able to troubleshoot when things go wrong. However, ML in practice can be much more than this, and MNIST is a good example of why.
It's deceptively easy to get 'good' performance on the MNIST dataset. For example, according to Yann Le Cun's website on MNIST performance, K nearest neighbours (K-NN) with the Euclidean distance metric (L2) also has an error rate of 3%, the same as your out-of-the-box random forest. L2 K-NN is about as simple as an ML algorithm gets. On the other hand, Yann, Yoshua, Leon & Patrick's best, first shot at this dataset, LeNet-4, has an error rate of 0.7%, 0.7% is less than a fourth of 3%, so if you put this system into practice reading handwritten digits, the naive algorithm requires four times as much human effort to fix its errors.
The convolutional neural network that Yann and colleagues used is matched to the task but I wouldn't call this 'feature engineering', so much as making an effort to understand the data and encode that understanding into the learning algorithm.
So, what are the lessons:
- It is easy to reach the naive performance baseline using an out-of-the-box method and good preprocessing. You should always do this, so that you know where the baseline is and whether or not this performance level is good enough for your requirements. Beware though, often out-of-the-box ML methods are 'brittle' i.e., surprisingly sensitive to the pre-processing. Once you've trained all the out-of-the-box methods, it's almost always a good idea to try bagging them.
- Hard problems require either domain-specific knowledge or a lot more data or both to solve. Feature engineering means using domain-specific knowledge to help the ML algorithm. However, if you have enough data, an algorithm (or approach) that can take advantage of that data to learn complex features, and an expert applying this algorithm then you can sometimes forego this knowledge (e.g. the Kaggle Merck challenge). Also, sometimes domain experts are wrong about what good features are; so more data and ML expertise is always helpful.
- Consider error rate not accuracy. An ML methods with 99% accuracy makes half the errors that one with 98% accuracy does; sometimes this is important.
Best Answer
A couple things come to mind...
Performing convolutions efficiently as products in the Fourier domain. An example would be training large convolutional neural nets.
For example, see: Fast Training of Convolutional Networks through FFTs (Mathieu et al. 2013)
Another application is sparse signal processing, where the goal is to approximate a signal as a sparse linear combination of basis functions from a 'signal dictionary'. The link here is that the set of sinusoids are, of course, a good dictionary for signals that are sparse in the Fourier domain. If I recall correctly, Fourier dictionaries show up in this literature.
On a related note, you should also be able to find Fourier methods in the compressed sensing literature