Lots of metric exist and no one is generally the best to use, it depends of your problem, of your data. Often, many metric can be used. I find usefull, to compute both hypothesis test and different metric (RMSE, MAPE ...), and see if they provide similar result. So your conclusions won't be based only on one metric.
In the linked blog post, Rob Hyndman calls for entries to a tourism forecasting competition. Essentially, the blog post serves to draw attention to the relevant IJF article, an ungated version of which is linked to in the blog post.
The benchmarks you refer to - 1.38 for monthly, 1.43 for quarterly and 2.28 for yearly data - were apparently arrived at as follows. The authors (all of them are expert forecasters and very active in the IIF - no snake oil salesmen here) are quite capable of applying standard forecasting algorithms or forecasting software, and they are probably not interested in simple ARIMA submission. So they went and applied some standard methods to their data. For the winning submission to be invited for a paper in the IJF, they ask that it improve on the best of these standard methods, as measured by the MASE.
So your question essentially boils down to:
Given that a MASE of 1 corresponds to a forecast that is out-of-sample as good (by MAD) as the naive random walk forecast in-sample, why can't standard forecasting methods like ARIMA improve on 1.38 for monthly data?
Here, the 1.38 MASE comes from Table 4 in the ungated version. It is the average ASE over 1-24 month ahead forecasts from ARIMA. The other standard methods, like ForecastPro, ETS etc. perform even worse.
And here, the answer gets hard. It is always very problematic to judge forecast accuracy without considering the data. One possibility I could think of in this particular case could be accelerating trends. Suppose that you try to forecast $\exp(t)$ with standard methods. None of these will capture the accelerating trend (and this is usually a Good Thing - if your forecasting algorithm often models an accelerating trend, you will likely far overshoot your mark), and they will yield a MASE that is above 1. Other explanations could, as you say, be different structural breaks, e.g., level shifts or external influences like SARS or 9/11, which would not be captured by the non-causal benchmark models, but which could be modeled by dedicated tourism forecasting methods (although using future causals in a holdout sample is a kind of cheating).
So I'd say that you likely can't say a lot about this withough looking at the data themselves. They are available on Kaggle. Your best bet is likely to take these 518 series, hold out the last 24 months, fit ARIMA series, calculate MASEs, dig out the ten or twenty MASE-worst forecast series, get a big pot of coffee, look at these series and try to figure out what it is that makes ARIMA models so bad at forecasting them.
EDIT: another point that appears obvious after the fact but took me five days to see - remember that the denominator of the MASE is the one-step ahead in-sample random walk forecast, whereas the numerator is the average of the 1-24-step ahead forecasts. It's not too surprising that forecasts deteriorate with increasing horizons, so this may be another reason for a MASE of 1.38. Note that the Seasonal Naive forecast was also included in the benchmark and had an even higher MASE.
Best Answer
I don't think there is a closed-form solution to this question. (I'd be interested in being proven wrong.) I'd assume you will need to simulate. And hope that your predictive posterior is not misspecified too badly.
In case it is interesting, we wrote a little paper (see also this presentation) once that explained how minimizing percentage errors can lead to forecasting bias, by rolling standard six-sided dice. We also looked at various flavors of MAPE and wMAPE, but let's concentrate on the sMAPE here.
Here is a plot where we simulate "sales" by rolling $n=8$ six-sided dice $N=1,000$ times and plot the average sMAPE, together with pointwise quantiles:
(Note that I'm using the alternative sMAPE formula which divides the denominator by 2.)
Something along these lines may help in your case. (Again, you will need to assume that your posterior predictive distribution is "correct enough" to do this kind of simulation - but you would need to assume that for any other approach, too, so this just adds a general caveat, not a specific issue.)
In this simple example of rolling standard six-sided dice, we can actually calculate and plot the expected s(M)APE as a function of the forecast:
This agrees rather well with the simulation averages above. And it shows nicely that the EsAPE-minimal forecast for rolling a standard six-sided die is a biased 4, instead of the unbiased expectation of 3.5.
Additional fun fact: if your predictive distribution is a Poisson with a predicted parameter $\hat{\lambda}<1$, then the forecast that minimizes the expected sAPE is $\hat{y}=1$ - independently of the specific value of $\hat{\lambda}$.
At least this is claimed in footnote 1 in Seaman & Bowman (in press, IJF, commentary on the M5 forecasting competiton) without a proof. It's quite easy to see that the EsAPE-minimal forecast satisfies $\hat{y}\geq 1$ (you just show that any alternative forecast $\hat{y}'<1$ will lead to a larger EsAPE). Showing that $\hat{y}'>1$ will lead to a larger EsAPE than $\hat{y}=1$ seems to be a little tedious. However, simulations look reassuring.