If I understand your question, one of your ideas is to calculate a normalization (center and scale) across all of your data: both test and training.
Imagine that you take all of your data and calculate a centering and a scaling once, then use this on both the training set and test set. Your training set represents the data you have now, your test set represents future data that you don't have when you are training. But you somehow magically calculated centering and scaling values that included this future data. A Leak From The Future (tm), which is bad.
When doing predictions with Random Forests, we very often (or always)
need to perform some pre-processing.
This is not true. Random Forest is really "off-the-shelf".
Outliers. Should we remove them all? If so, we consider an outlier
based on the 3/2 rule? Should we keep them? Why?
The base model used in RF is a large decision tree (usually built via CART). Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves). Bottom line: you don't care about outliers in RF. Just remove them if they are aberrant observations (e.g., due to recording errors). If they're valid cases, you can keep them.
When dealing with deltas of observations (as an example, suppose I'm
subtracting a student grade from another), should I normalize the
delta of all students or just stick to the absolute delta? Sticking to
the same student case, If I have cumulative data (suppose for every
test I sum their last grades). Should the process be the same?
The question here is not really related to RF, it is algorithm independent. The real question is what do you want to do? What are you trying to predict?
Do we need to apply any data transformation like log or any other? If
so, when should it be done? When data range is large? What's the point
of changing the domain of the data here?
For the same reasons you don't need to worry about outliers, you don't need to apply any kind of data transformation when using RF. For classification, you may need to apply some kind of resampling/weighing strategy if you have a class-imbalance problem, but that's it.
If I have a categorical target, can I apply regression instead of
classification so the output would be (suppose the classes are 0, 1,
2) 0.132, 0.431; so would it be more accurate?
You cannot apply regression if your target is categorical.
In what kind of problems is Random Forest more indicated? Large
datasets?
RF is indicated for all types of problems. People (especially in the medical field, genomics, etc.) even use it primarily for its variable importance measures. In genetics, where the guys face the "small $n$ - large $p$" problem, RF also does very well. Anyhow, Machine Learning in general requires sufficient amounts of training and testing data, though there's no general rule. If your training data represents all your concepts and if these concepts are easily capturable, a couple of hundred observations may suffice. However, if what should be learned is very complex and if some concepts are underepresented, more training data will be needed.
Should I discart the less important variables? Maybe it just creates
noise?
Another nice feature of decision trees built through CART is that they automatically put aside the non-important variables (only the best splitters are selected at each split). In the seminal book by Hastie et al. (2009), the authors showed that with 100 pure noise predictors, and 6 relevant predictors, the relevant variables were still selected 50% of the time at each split. So you really don't need to worry about variable selection in RF. Of course if you know that some variables are not contributing, don't include them, but if the underlying mechanisms of the process you're studying are mostly unknown, you can include all your candidate predictors.
Best Answer
You should do the same preprocessing on all your data however if that preprocessing depends on the data (e.g. standardization, pca) then you should calculate it on your training data and then use the parameters from that calculation to apply it to your validation and test data.
For example if you are centering your data (subtracting the mean) then you should calculate the mean on your training data ONLY and then subtract that same mean from all your data (i.e. subtract the mean of the training data from the validation and test data, DO NOT calculate 3 separate means).
For cross-validation, you'll have to calculate it for each iteration on the folds in the training set and then apply that calculation to the validation fold. If you then train a model using all your data after that, then you need to find the parameters for the preprocessing step using all the CV data.