Say that I am building a linear regression model p for predicting some values $y_1,…,y_n$.
If the data contains a few extreme outliers in the response - or even just one - the MSE fitted equation can be pulled arbitrarily far away from the MAE one.
Consider the simplest regression model (just an intercept, $\alpha$), and following data:
0.0003 0.0001 0.0002 0.0004 50000 0.0002 0.0004 0.0003 0.0001 0.0003
The MAD solution is $\alpha$ = 0.0003. The MSE solution is 5000.00023.
The MAD of the minimum MAD solution is about 0.0001. The MAD of the minimum MSE solutions is about 5000. You can potentially do very badly, if you use MSE when the criterion is MAD.
We can consider the GLM as having two components, model for the mean and a model for the variance. This is even more explicit with the quasi-GLM case.
The mean is assumed proportional to the exposure; with a log-link (which is what I presume you have), you could try to adjust for the effect of exposure on the mean either by dividing the data by exposure or by using an offset of log-exposure. Both have the same effect on the mean.
However, depending on the particular distribution that's operating*, they can have different effects on the variance.
*(as well as other drivers like dependence and unmodelled effects)
When you divide by exposure you divide the variance by exposure-squared (this is just a basic variance property - $\text{Var}(\frac{X}{e_i})=\frac{1}{e_i^2} \text{Var}(X)$). Equivalently, scaling by exposure reduces the standard deviation in proportion to the mean (leaving the coefficient of variation constant). This might suit claim amounts but doesn't fit with a quasi-Poisson model for claim counts.
[For example, a model for aggregate claim payments might consider a Gamma GLM (which has variance proportional to mean squared, or constant coefficient of variation) having an offset of log-exposure will reduce the fitted mean by a factor of exposure and so (because the model has variance proportional to mean-squared) will reduce variance by the square of exposure. So for a Gamma GLM with log-link the two approaches are identical; this is also true for other models where your model for the mean is proportional to a scale parameter and the variance is proportional to the square of the mean, including lognormal models, Weibull models and a number of others.]
For a quasi-Poisson GLM with log link, in the model, the variance is proportional to mean, not mean squared. As such, when you fit log-exposure as an offset it reduces fitted variance according to the model - proportional to the change in mean. As we saw above, when you divide by exposure you change it according to mean-squared.
If the quasi-Poisson model was actually the correct model for your counts, then you should certainly use an offset of log-exposure, since it would describe the impact on variance correctly as Ben indicated.
However, for claim counts, a quasi-Poisson model is at best a rough approximation.
If you have heterogeneity, a negative binomial would tend to model the variability better, and it doesn't have variance proportional to mean; however often it, too doesn't really capture the variance effect -- some important drivers of claim frequency may lead to even stronger relationship to the mean.
Realistically, exposure won't exactly impact the variance in proportion to the mean. Many effects we're aware of will work to make that contribution to the variance increase somewhat faster than the mean does.
For counts, the variance assumption in the quasi-Poisson model will at least sometimes be close to correct; if your model is quasi Poisson, then you'll certainly get the variance wrong (according to your model) if you divide by exposure.
You could make an assessment of whether variance is well approximated as proportional to mean at model fitting time by considering the usual model diagnostics (if it isn't, you should not be using a model that says it is; if it is, then you should deal with exposure properly according to your model).
[Of course, exposure may not impact the variance in the model the same way as the rest of the drivers tend to, but that might be introducing more complexity than you have data to deal with.]
Best Answer
I used to develop these models professionally for a major casualty insurer, and probably had a part in developing the data for one of the Kaggle competition's you're referencing. So I'm relatively well positioned for this question.
The goal of these models is to price insurance contracts. I.e., we want to know, for a customer who as purchased an insurance contract, how much our company will pay out in total claim costs for the customer. So let's let $X$ denote all the measurements we have for a single customer we've insured.
There are two possibilities for what happens over the life of the contract:
The insured files no claims. In this case the company pays out nothing. Let's call $F$ the random variable counting the number of claims filed by the insured over the contract period. This is often assumed to be poisson distributed, as a decent approximation. In the jargon of the industry, this random variable is called the frequency.
The insured files at least one claim. Then, for each claim, a random amount is payed out by our company. Let's denote the amount payed out for the $i$'th claim $S_i$. This is a continuous random variable with a heavy right tail. It is often assumed these are gamma distributed, because the shape is intuitively reasonable. In the jargon of the industry, these are called the severity.
Putting that all together, the amount payed out over the insurance contract is a random variable:
$$Y \mid X = \sum_{i \sim F} S_i $$
This is a funny little equation, but basically there is a random number of summands, according the the frequency $F$, and each summand $S_i$ is a random claim amount (for a single claim).
If $P$ is poisson, and each $S_i$ is a gamma distribution, this is the Tweedie distribution. Reasonable assumptions lead to a parametric assumption that $Y \mid X$ is Tweedie distributed.
As noted above, sort of. It's actually the conditional distribution of the response variable (so $Y \mid X$, not the marginal $Y$), which we never really observe. Some features of the conditional distributions manifest in the marginal, like the large point mass at zero.
Nope. It's the conditional distribution $Y \mid X$ that guides the choice of loss function, which often comes from thought and imagination like the above. The (marginal) distribution of $Y$ can be skew even if the conditional distributions $Y \mid X$ is symmetric. For example:
$$ X \sim \text{Poisson}(\lambda = 1.0) $$ $$ Y \mid X \sim \text{Normal}(\mu = X, \sigma = 1.0) $$
Will lead to a right skew marginal distribution of $Y$, but the least squares loss is exactly correct to use.
I haven't done any projects in this area, but that sounds like a reasonable approach.
There's no magic here, there's no principled theory about claims distributions. Roughly, it has the correct shape: it's positively supported (i.e. $P(G \leq 0) = 0$), it's unimodal, and it has a positive skew; and it leads to mathematically tractable models. That's about it, it's just a reasonable choice which has worked well for a long time.