Mean Square Error (MSE) is used in Regression problems to compute the error in prediction. Large errors have a large influence on the MSE, and small errors have almost negligible influence on the error. So if an observation is giving a large error, there's a possibility that the observation is an outlier. And MSE will work really hard to minimize that error. So, does minimize MSE cause Underfitting?
Can Mean Square Error cause underfitting
mseoptimizationregression
Related Solutions
Probably, you can log transform (or any other scale transform) the target variable and then use RMSE. It might remove the high outlier impact.
It depends on what functional of the future distribution you want to elicit.
Put differently, future outcomes follow some probability distribution (which, judging from your description, may be heavy-tailed and/or zero-inflated), and the point forecast you want to evaluate is a "one number summary" of this distribution. This holds even if you do not explicitly look at the distribution - it will always be there and lurking under the surface.
The issue is that different error measures elicit different one number summaries from the underlying distribution. The MSE is minimized in expectation by the expectation of the distribution. The MAE is minimized by its median. (That the MSE is more strongly influenced by the tail of the distribution than the MAE is just another way of saying that the expectation of the distribution is more strongly influenced by the tail than the median.) A quantile loss will be optimized by the appropriate quantile.
One consequence is that different point forecasts will be optimal for different error measures. Another one is that you should remember that your OLS regression will likely optimize the MSE as an objective function, so it does not really make sense to evaluate forecasts from an OLS model using the MAPE. (The MAE makes sense if you believe in symmetric errors, which again does not seem to be the case here.)
So the question should first be what functional you are interested in, and only after you have given this some thought should you pick an appropriate error measure. Which functional solves your problem, in turn, depends on what you want to do with the point forecast afterwards.
More information can be found at What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?, at Why use a certain measure of forecast error (e.g. MAD) as opposed to another (e.g. MSE)? and in Kolassa (2020).
Best Answer
By itself, no. The choice of loss function depends on your data and the nature of the problem you are trying to solve. As you noticed, mean square error is sensitive to large errors, while something like mean absolute error is less sensitive. Such sensitiveness is sometimes a desirable property, while in other cases you need a robust loss function that is insensitive. You use squared error when you need it to be sensitive.
No matter what loss you choose, overfitting is about the whole model, not only the loss. For example, if you use a simple regression model that minimizes the squared error, it has no chance to overfit, because it is not expressive enough. On another hand, if you use instead something like $k$NN no matter of the loss, it could overfit. Basically, if the model can drag the training error to zero (e.g. model with enough parameters to memoize the data), it eventually will, no matter what the loss is.