This is a good question and unfortunately unanswered for a long time, it seems that there was a partial answer given just a couple months after you asked this question here that basically just argues that correlation is useful when the outputs are very noisy and perhaps MSE otherwise. I think first of all we should look at the formulas for both.
$$MSE(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^n(y_i - \hat{y_i})^2$$
$$R(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \bar{y})(\hat{y_i} - \hat{\bar{y}})}
{\sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2} \sqrt{\sum ^n _{i=1}(\hat{y_i} - \hat{\bar{y}})^2}} $$
Some few things to note, in the case of linear regression we know that $\hat{\bar{y}} = \bar{y}$ because of unbiasedness of the regressor, so the model will simplify a little bit, but in general we can't make this assumption about ML algorithms. Perhaps more broadly it is interesting to think of the scatter plot in $\mathbb{R^2}$ of $ \{ y_i, \hat{y_i}\} $ correlation tells us how strong the linear relationship is between the two in this plot, and MSE tells us how far from the diagonal each point is. Looking at the counter examples on the wikipedia page you can see there are many relationships between the two that won't be represented.
I think generally correlation is tells similar things as $R^2$ but with directionality, so correlation is somewhat more descriptive in that case. In another interpretation, $R^2$ doesn't rely on the linearity assumption and merely tells us the percentage of variation in $y$ that's explained by our model. In other words, it compares the model's prediction to the naive prediction of guessing the mean for every point. The formula for $R^2$ is:
$$R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i-\hat{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}$$
So how does
$R$ compare to
$R^2$? Well it turns out that
$R$ is more immune to scaling up of one of the inputs this has to do with the fact that
$R^2$ is homogenous of degree 0 only in both inputs, where
$R$ is homogenous of degree 0 in either input. It's a little less clear what this might imply in terms of machine learning, but it might mean that the model class of
$\hat{y}$ can be a bit more flexible under correlation. This said, under some additional assumptions, however, the two measures are equal, and you can read more about it here: http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/.
Finally a last important thing to note is that $R$ and $R^2$ do not measure goodness of fit around the $y=\hat{y}$ line. It is possible (although odd) to have a predictor be linearly shifted away from the $y=\hat{y}$ line with an $R^2$ of one, but the predictions would still be "bad". In this case, MSE would be more informative in finding the better predictor than $R^2$ but perhaps this is more of a pathological case than an issue with using $R$ and $R^2$ as metrics.
The Poisson distribution is one integer-valued distribution among many alternatives. You can experiment with alternative losses.
The model's predictions for the Poisson model is the conditional expectation, so there's no reason for it to be restricted to integers in general. To simplify, consider that the average of $(1,2,3,4)$ is not an integer, even though each of the elements is an integer. Naturally, you can do stuff like rounding to obtain integers -- whether or not it's the best choice depends on your goals and how you define "best."
The metric isn't the reason that you have poor fit, it's just a thing that measures how good the fit is. A better model will improve the fit (tautologically).
Best Answer
If the data contains a few extreme outliers in the response - or even just one - the MSE fitted equation can be pulled arbitrarily far away from the MAE one.
Consider the simplest regression model (just an intercept, $\alpha$), and following data:
The MAD solution is $\alpha$ = 0.0003. The MSE solution is 5000.00023.
The MAD of the minimum MAD solution is about 0.0001. The MAD of the minimum MSE solutions is about 5000. You can potentially do very badly, if you use MSE when the criterion is MAD.