One basic technique is Naive Bayes, which is often used for spam filtering. Essentially, you look at the frequencies of words appearing in sentences that you have already judged to be spam, and also at the frequencies of those words appearing in sentences you have already judged to be not spam, and then use those frequencies to make a judgement about new sentences.
You first estimate the conditional probabilities:
- feature on(1) or off(0) given spam, $\newcommand{\bm}[1]{\mathbf{#1}}P(\bm{f}_i=1|S), P(\bm{f}_i=0|S)$ for all features $\bm{f}_i$, and
- feature on or off given not spam, $P(\bm{f}_i=1|\neg S), P(\bm{f}_i=0|\neg S)$.
These can be estimated from the training set by summing columns for spam training examples and dividing by the number of spam examples and likewise for non-spam examples. You also need to estimate the incidence of spam: the prior probabilities for spam $P(S)$ and not spam $P(\neg S)$.
Given a new example with a feature vector $\bm{f}$, you can use Bayes rule and the naive assumption of independence of feature probabilities to write
$$P(S|\bm{f})=\frac{P(\bm{f}|S)P(S)}{P(\bm{f})} = \frac{P(\bm{f}_1|S)P(\bm{f}_2|S)\ldots P(\bm{f}_n|S)\;P(S)}{P(\bm{f})}=\frac{\left(\prod P(\bm{f}_i|S)\right)\;P(S)}{P(\bm{f})}$$
Similarly $P(\neg S|\bm{f}) = (\prod P(\bm{f}_i|\neg S))P(\neg S)\;/P(\bm{f})$. So you can get the probabilities $P(S|\bm{f})$ and $P(\neg S|\bm{f})$ based on the probabilities estimated earlier. You decide whether the new example is spam or not spam based on which probability is higher, which means that you can drop the denominator $P(\bm{f})$.
This is easy to implement in rapidminer, from the looks of a quick web search. It's generally very easy to implement from scratch: you're talking low 10s of lines of code in a language like R or Matlab. If you do so, it's worth noting that when the estimated conditional probabilities are zero, you need to pick some small non-zero value instead to avoid the products of conditional probabilities being zero. (Also, it's worth considering doing multiplications by adding logs.)
Naive Bayes is a very simple technique, but it has some drawbacks. One is how to deal with conditional probabilities that are zero, as discussed above. Perhaps more seriously, Naive Bayes as described here pays no attention to ordering or context (words occurring together can have quite different meanings from occurring singly). A good way to find information on more sophisticated techniques would be to research sentiment analysis.
Machine learning (ML) in practice depends on what the goal of doing ML is. In some situations, solid pre-processing and applying a suite of out-of-the-box ML methods might be good enough. However, even in these situations, it is important to understand how the methods work in order to be able to troubleshoot when things go wrong. However, ML in practice can be much more than this, and MNIST is a good example of why.
It's deceptively easy to get 'good' performance on the MNIST dataset. For example, according to Yann Le Cun's website on MNIST performance, K nearest neighbours (K-NN) with the Euclidean distance metric (L2) also has an error rate of 3%, the same as your out-of-the-box random forest. L2 K-NN is about as simple as an ML algorithm gets. On the other hand, Yann, Yoshua, Leon & Patrick's best, first shot at this dataset, LeNet-4, has an error rate of 0.7%, 0.7% is less than a fourth of 3%, so if you put this system into practice reading handwritten digits, the naive algorithm requires four times as much human effort to fix its errors.
The convolutional neural network that Yann and colleagues used is matched to the task but I wouldn't call this 'feature engineering', so much as making an effort to understand the data and encode that understanding into the learning algorithm.
So, what are the lessons:
- It is easy to reach the naive performance baseline using an out-of-the-box method and good preprocessing. You should always do this, so that you know where the baseline is and whether or not this performance level is good enough for your requirements. Beware though, often out-of-the-box ML methods are 'brittle' i.e., surprisingly sensitive to the pre-processing. Once you've trained all the out-of-the-box methods, it's almost always a good idea to try bagging them.
- Hard problems require either domain-specific knowledge or a lot more data or both to solve. Feature engineering means using domain-specific knowledge to help the ML algorithm. However, if you have enough data, an algorithm (or approach) that can take advantage of that data to learn complex features, and an expert applying this algorithm then you can sometimes forego this knowledge (e.g. the Kaggle Merck challenge). Also, sometimes domain experts are wrong about what good features are; so more data and ML expertise is always helpful.
- Consider error rate not accuracy. An ML methods with 99% accuracy makes half the errors that one with 98% accuracy does; sometimes this is important.
Best Answer
I would say experience -- basic ideas are: