I like to think of this in analogy with the case of linear models, and their extension to GLMs (generalized linear models).
In a linear model, we fit a linear function to predict our response
$$ \hat y = \beta_0 + \beta_1 x_1 + \cdots \beta_n x_n $$
To generalize to other situations, we introduce a link function, which transforms the linear part of the model onto the scale of the response (technically this is an inverse link, but I think it's easier to think of it this way, transforming the linear predictor into a response, than transforming the response into a linear predictor).
For example, the logistic model uses the sigmoid (or logit) function
$$ \hat y = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n))} $$
and poisson regression uses an exponential function
$$ \hat y = \exp(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n) $$
To construct an analogy with gradient boosting, we replace the linear part of these models with the sum of the boosted trees. So, for example, the gaussian case (analogous with linear regression) becomes the well known
$$ \hat y = \sum_i h_i $$
where $h_i$ is our sequence of weak learners. The binomial case is analogous to logistic regression (as you noted in your answer)
$$ \hat y = \frac{1}{1 + \exp\left(-\sum_i h_i\right)} $$
and poisson boosting is analogous to poisson regression
$$ \hat y = \exp\left(\sum_i h_i\right) $$
The question remains, how does one fit these boosted models when the link function is involved? For the gaussian case, where the link is the identity function, the often heard mantra of fitting weak learners to the residuals of the current working model works out, but this doesn't really generalize to the more complicated models. The trick is to write the loss function being minimized as a function of the linear part of the model (i.e. the $\sum_i \beta_i x_i$ part of the GLM formulation).
For example, the binomial loss is usually encountered as
$$ \sum_i y_i \log(p_i) + (1 - y_i)\log(1 - p_i) $$
Here, the loss is a function of $p_i$, the predicted values on the same scale as the response, and $p_i$ is a non-linear transformation of the linear predictor $L_i$. Instead, we can re-express this as a function of $L_i$, (in this case also known as the log odds)
$$ \sum_i y_i L_i - \log(1 + \exp(L_i)) $$
Then we can take the gradient of this with respect to $L$, and boost to directly minimize this quantity.
Only at the very end, when we want to produce predictions for the user, do we apply the link function to the final sequence of weak learners to put the predictions on the same scale as the response. While fitting the model, we internally work on the linear scale the entire time.
It is reasonably widely recognised that feature engineering improves the outcome when using relatively advanced algorithms such as GBMs or Random Forests. There are a few reasons, relating both to overall accuracy and to useability.
Firstly, if you actually want to use the model, features will require maintenance and implementation and will often require explanation to users. That is, each extra feature will create extra work. So for practical purposes, it's useful to eliminate features that don't contribute materially to improved accuracy.
With respect to overall accuracy, additional features and/or poorly engineered features increase the likelihood that you're training your model on noise rather than signal. Hence using domain knowledge or inspection of the data to suggest alternative ways to engineer features will usually improve results. The kaggle blog - blog.kaggle.com - includes 'how they did it' write-ups from podium finishers in each competition. These usually include descriptions of feature engineering - arguably more frequently than descriptions of model tuning, emphasising the importance of feature engineering - and some of them are very creative, including leveraging off domain knowledge provided by competition organisers or otherwise discovered during the competition.
This recent write-up is a good example of domain knowledge acquired during competition being used to select/ engineer features https://medium.com/kaggle-blog/2017-data-science-bowl-predicting-lung-cancer-2nd-place-solution-write-up-daniel-hammack-and-79dc345d4541 (the sections headed 'Pre-processing' and 'External Data' give good examples).
Best Answer
I've answered question
2a
on this site before.The answer to
2b
, as you suspect, is the same. In general, gradient boosting, when used for classification, fits trees not on the level of the gradient of predicted probabilities, but to the gradient of the predicted log-odds. Because of this,2b
reduces to2a
in principle.As for
1
:The power of gradient boosting is that it allows us to build predictive functions of great complexity. The issue with building predictive functions of great complexity is in the bias variance tradeoff. Large complexity means very low bias, which unfortunately is wed to very high variance.
If you fit a complex model in one go (like a deep decision tree for example) you have done nothing to deal with this variance explosion, and you will find that your test error is very poor.
Boosting is essentially a principled way of carefully controlling the variance of a model when attempting to build a complex predictive function. The main idea is that we should build the predictive function very slowly, and constantly check our work to see if we should stop building. This is why using a small learning rate and weak individual learners is so important to using boosting effectively. These choices allow us to layer on complexity very slowly, and apply a lot of care to constructing out predictive function. It allows us many places to stop, by monitoring the test error at each stage in the construction.
If you do not do this, your boosted model will be poor, often as poor as a single decision tree. Try setting the learning rate to $1.0$ in a gradient boosted model, or using very deep trees as individual learners.