I like to think of this in analogy with the case of linear models, and their extension to GLMs (generalized linear models).
In a linear model, we fit a linear function to predict our response
$$ \hat y = \beta_0 + \beta_1 x_1 + \cdots \beta_n x_n $$
To generalize to other situations, we introduce a link function, which transforms the linear part of the model onto the scale of the response (technically this is an inverse link, but I think it's easier to think of it this way, transforming the linear predictor into a response, than transforming the response into a linear predictor).
For example, the logistic model uses the sigmoid (or logit) function
$$ \hat y = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n))} $$
and poisson regression uses an exponential function
$$ \hat y = \exp(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n) $$
To construct an analogy with gradient boosting, we replace the linear part of these models with the sum of the boosted trees. So, for example, the gaussian case (analogous with linear regression) becomes the well known
$$ \hat y = \sum_i h_i $$
where $h_i$ is our sequence of weak learners. The binomial case is analogous to logistic regression (as you noted in your answer)
$$ \hat y = \frac{1}{1 + \exp\left(-\sum_i h_i\right)} $$
and poisson boosting is analogous to poisson regression
$$ \hat y = \exp\left(\sum_i h_i\right) $$
The question remains, how does one fit these boosted models when the link function is involved? For the gaussian case, where the link is the identity function, the often heard mantra of fitting weak learners to the residuals of the current working model works out, but this doesn't really generalize to the more complicated models. The trick is to write the loss function being minimized as a function of the linear part of the model (i.e. the $\sum_i \beta_i x_i$ part of the GLM formulation).
For example, the binomial loss is usually encountered as
$$ \sum_i y_i \log(p_i) + (1 - y_i)\log(1 - p_i) $$
Here, the loss is a function of $p_i$, the predicted values on the same scale as the response, and $p_i$ is a non-linear transformation of the linear predictor $L_i$. Instead, we can re-express this as a function of $L_i$, (in this case also known as the log odds)
$$ \sum_i y_i L_i - \log(1 + \exp(L_i)) $$
Then we can take the gradient of this with respect to $L$, and boost to directly minimize this quantity.
Only at the very end, when we want to produce predictions for the user, do we apply the link function to the final sequence of weak learners to put the predictions on the same scale as the response. While fitting the model, we internally work on the linear scale the entire time.
Best Answer
I think we must first consider if the outliers are "true data" or just simply noise/corrupted input.
If they are corrupted data (e.g. an adult human weighting 775 kg) then it is perfectly reasonable to exclude these instances from further analysis. If these instances though are reasonable data we might want to work with them, rather than around them. A first obvious fix that does not involve data transformations, would be to employ a custom objective function approximating a MAE, a Huberised loss or a quantile loss. That would allow minimising the effect of instances that might seem highly unnatural. In general and without containing yourself to gradient boosting, I would suggest looking into robust statistics to get a better idea of how one would classically deal with potentially noisy and/or skewed data (for example using a GAM with a scaled-T distribution for the family of the response).
As you say, potentially transforming then back-transforming your data (
log(x+1)
being a common choice of strictly non-negative data) is also a potentially reasonable approach. Go for it, just do not get too crazy because while model interpretability is not a prime concern when predicting, if the transformation is just too convoluted (e.g. through some arbitrary power tranformation), debugging and/or improving an existing model becomes even more complicated than it should.Finally, I would suggest you look into some data competitions that are concerned with skewed variables themselves (e.g. the Allstate Insurance claims severity predictions), these guys have some nifty ideas too!