By sampling we make the algorithm think that the prior probabilities
of the classes are the same. This seems to affect the predictions as
well and therefore the probabilities cannot be interpreted as
probabilities anymore and have to be recalibrated.
You seem mostly correct, the phenomenon is called prior probability shift. It is the case where $P_{train}(x|y) = P_{test}(x|y)$ but $P_{train}(y) \neq P_{test}(y)$. I'm not sure about what you mean by 'probabilities not remaining the same and have to be recalibrated'.
Lets assume we have a two class classification problem, imbalanced
datasets that we oversample to get the same class distribution. We run
decision trees on it. The test set in imbalanced but does it really
matter?
Yes, it matters and that is the cause of the problem.
Each sample of the test set just goes through the nodes of the decision trees and it never checks if the sample belongs to the majority or minority class.
Correct. The problem is not with how decision tree predicts a given sample point. The issue lies with the way it was trained and with the characterization of the feature space for each class.
Oversampling the minority class is a way to deal with the imbalanced class problem but it is not ideal. When the minority class is over-sampled by increasing amounts, the effect is to identify similar but more specific regions in the feature space as the decision region for the minority class. The decision tree would predict a given point in the way that you mentioned but if its decision regions are not accurate based on the way it was trained then it won't predict well.
So, why does the prior probability of classes affect the prediction of a sample?
Prior probability shift is a particular type of dataset shift. There's a fair amount of work in the literature on this topic and whether it's a generative or discriminative model, both of them suffer from the problem. The general idea is whether you are trying to train a discriminative model $P(y|x) = \frac{P(x|y)P(y)}{P(x)}$ or a generative model $P(x,y) = P(x|y)P(y)$, the change in $P(y)$ affects $P(y|x)$ and $P(x,y)$. If the $P(x)$ changes in train and test dataset, then the phenomenon is called covariate shift. You can learn more about dataset shift here, it was probably one of the first compilation of the work done on dataset shift.
On a side note, you can refer to this paper on SMOTE. It addresses the oversampling issue with decision tree and provides a better way to rebalance the dataset by creating synthetic points of the minority class. It is widely used and I believe various implementations of this method already exists.
1) Does random oversampling the minority class increase the size of
the final data set such that each class is the same size as that of
the majority?
It does, however it doesn't have to. Depending on the implementation one could of course also oversample to a size that is bigger than the original majority class if the misclassification costs associated with the problem warrent that.
Same goes for undersampling - the general idea is to balance the data set, so getting them to matching size makes sense unless we have a reason to tip the class balance the other way.
3) Are these methods random sampling with or without replacement?
In the case of the first graphic you posted and assuming our objective is to even out the size of the classes, we will have to sample with replacement unless we synthezise new instances from the known ones, like SMOTE does.
In the case of undersampling both with and without replacement is possible, although I'm only aware of it being used without replacement then.
Best Answer
Removing samples from the majority class may cause the classifier to miss important concepts/features pertaining to the majority class.
One strategy called informed undersampling demonstrated good results. Unsupervised learning algorithm is used to perform independent random sampling from majority class. Multiple classifiers based on the combination of each majority class subset with the minority class data are chosen.
Another example of informed undersampling uses the K-nearest neighbor (KNN) classifier to achieve undersampling. One of the four methods on KNN, looks most straightforward, called NearMiss-3, selects a given number of the closest majority samples for each minority sample to guarantee that every minority sample is surrounded by some majority samples. However, another method, NearMiss-2, in which the majority class samples are selected if their average distance to the three farthest minority class samples are the smallest, is proved the most competitive approach in imbalanced learning.
The profit (cost) matrix can be considered as a numerical representation of the penalty of classifying samples from one class to another. In decision tree,
(1) cost-sensitive adjustments can be applied to the decision threshold;
ROC curve is applied to plot the range of performance values as the decision threshold is moved from the point where the total misclassifications on majority class are maximally costly to the point where total misclassifications on the minority class are maximally costly. The most dominant point on the ROC curve corresponds to the final decision threshold. Read this paper for more details.
(2) cost-sensitive considerations can be given to the split criteria at each node;
This is achieved by fitting an impurity function, and the split with maximum fitting accuracy at each node is selected. This tutorial generalizes the effects of decision tree growth for any choice of spit criteria.
(3) cost-sensitive pruning schemes can be applied to the tree.
Pruning improves generalization by removing leaves with class probability estimates below a specified threshold. Laplace smoothing method on pruning technique is described in the same tutorial here to reduce the probability that pruning removes leaves on the minority class.