As explained in this answer, multiple imputation is used to generate multiple datasets, perform statistical analysis on them, and average the results. Basically, multiple imputation takes a simple imputation and adds to it a random value to try to restore randomness lost in the imputation process. Therefore, averaging multiple imputations before doing any statistical analysis on them just removes most of that restored randomness (by averaging) and gives a result close to simple imputation plus an small random error.
Therefore, there is no advantage in using multiple imputations to report average of imputations over just using and reporting simple imputation.
Short answer: your gut-feeling is right.
Longer answer: The strength of imputation lies in the pooling procedure. If you read the MICE manual, the writers go in depth about this. They state that imputation is not a technique that you apply to a dataset with missing data to complete the empty cells, but that it is a combination of setting up a strategy to replace missing data (using chained equations in the mice case), performing the analysis and subsequently pooling the result which answers your research question (i.e. the reason you performed the analysis). As such, these steps are mandatory.
Now, more specifically to your situation. In the original data with missings there might be data which is selectively missing. This could lead to bias. Moreover, most analyses require complete data on all variables, so you'll need to exclude cases or handle them in some way. Using imputation you complete the missing data with 'guestimates' based on the assumption your data is 'missing at random, conditional on known and observed data' (MAR). However, because these are guesses based on your data, you add some randomness and repeat this completion process multiple times in order to create a distribution of guesses.
If you would analyse these data in the 'long' format you mention, you'd have basically pumped up your sample's size with a factor equal to the number of imputation sets! This will undoubtedly increase the precision of your estimates. But, this is wrong! Cases which were complete from the start will have been copied and more importantly, you did not take into account the uncertainty of your guestimates.
The better way therefore is to analyze the data per imputation set. This way you get an 'm' amount of results. However, you do not know which imputation set is the 'most correct' (if there even is such a dataset). As such the average coefficient of all models is your best estimate for the 'true' estimate. For the precision (and hypothesis testing/confidence intervals) you then need to appropriately handle the standard error. Now the uncertainty comes into play. Using Rubins rules you average the standard errors of all and add 'a little extra' to represent the variation of the estimates across imputation sets.
Conclusion
Finally, creating your confidence intervals and performing your hypothesis tests using these pooling rules usually decreases bias and biased inferences compared to a complete case analysis. Compared to your long-format dataset, the coefficient might be pretty similar, but as you and your gut-feeling rightly pointed out, the results are way to precise (too narrow confidence intervals; too low p-values) than can actually should be concluded from these imputation analyses.
Best Answer
I believe that your thinking is right.
The alternative is to perform multiple imputation on the entire dataset prior to splitting into train/test partitions. Doing so would mean that some information from the training sets is used to create/impute values in the test sets. In other words, there would be leakage from the training sets into the test sets, thereby biasing the results of cross-validation.