I am going to define the GBM algorithm first to clarify the question:
Algorithm 10.3: Gradient Tree Boosting Algorithm from Hastie's The Elements of Statistical Learning states the following:
Where $N$ is the number of samples, $M$ is the number of iterations and $J_m$ is the number of terminal regions or size of the tree. Line 3 produces $\hat{f}(x)$ which is a $K$ sized vector where $K$ corresponds to the number of classes.
How is the residual transformed before being fit by a tree? Log of the residual?
The residual is defined above by (a) and depends directly on the loss function. Similar to other problems, the loss function depends on the distribution chosen to model the conditional probability of y|x and is analogous to the negative log-likelihood of the distribution. For a Poisson distribution the log-likelihood is:
$$ll(y;\lambda) = \sum_k y_k\log(\lambda_k) - \lambda_k - \log(x_k!)$$
Or the loss:
$$L(y;\lambda) = \sum_k \lambda_k + \log(x_k!) - y_k\log(\lambda_k) $$
Since we want to minimize the loss and will eventually be taking a derivative, we can drop the constant term $\log(x_k!)$ and the above simplifies to:
$$L(y;\lambda) = \sum_k \lambda_k - y_k\log(\lambda_k) $$
In the algorithm outlined above, the residual is equivalent to the partial derivative w.r.t $f(x_i)$ evaluated at $f_{m-1}$. This can be interpreted as the residual produced by our previous update of $f$.
Do we need to apply a transform at the end to get the prediction? Exemple : prediction = exp(average of Y + prediction of the first tree + prediction of the second tree + …)
To produce probabilities the following transformation is used:
$$p_k(x) = \frac{\exp{f_k(x)}}{\sum_{l=1}^K \exp{f_l(x)}}$$
where again, $k$ is the class of interest. To make a prediction we simply take the $k$ that produces the maximum $p$.
Gradient descent is not used for training decision trees. Not every machine learning algorithm uses a general optimization algorithm (e.g. gradient descent) for training, some of them use specialized algorithms for training them. Examples of such algorithms are $k$-NN, naive Bayes, or decision trees, in case of those algorithms we don't train them by directly minimizing some loss function to find best set of parameters (in fact, $k$-NN, or decision tree don't have parameters per se), but have their own algorithms for finding the solutions.
As a side note, gradient descent is even not always used for the algorithms that do train by directly minimizing loss function. It is only one of the many optimization algorithms. Moreover, it is not even the most efficient algorithm, as there are many algorithms that work better for some problems. Gradient descent got popular mostly because it is easy and efficient to use for training neural networks, but that does not make it "one size fits all" algorithm.
Best Answer
In fact, there are not too much difference between regression and classification. The only difference is the loss function. In regression, the model is trying to minimize e.g., RSME. In classification, the model is trying to minimize the logistic loss.
Details can be found here.
Regularization methods for logistic regression