You are right, this is a significant problem in machine learning/statistical modelling. Essentially the only way to really solve this problem is to retain an independent test set and keep it held out until the study is complete and use it for final validation.
However, inevitably people will look at the results on the test set and then change their model accordingly; however this won't necessarily result in an improvement in generalisation performance as the difference in performance of different models may be largely due to the particular sample of test data that we have. In this case, in making a choice we are effectively over-fitting the test error.
The way to limit this is to make the variance of the test error as small as possible (i.e. the variability in test error we would see if we used different samples of data as the test set, drawn from the same underlying distribution). This is most easily achieved using a large test set if that is possible, or e.g. bootstrapping or cross-validation if there isn't much data available.
I have found that this sort of over-fitting in model selection is a lot more troublesome than is generally appreciated, especially with regard to performance estimation, see
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010 (www)
This sort of problem especially affects the use of benchmark datasets, which have been used in many studies, and each new study is implicitly affected by the results of earlier studies, so the observed performance is likely to be an over-optimistic estimate of the true performance of the method. The way I try to get around this is to look at many datasets (so the method isn't tuned to one specific dataset) and also use multiple random test/training splits for performance estimation (to reduce the variance of the estimate). However the results still need the caveat that these benchmarks have been over-fit.
Another example where this does occur is in machine learning competitions with a leader-board based on a validation set. Inevitably some competitors keep tinkering with their model to get further up the leader board, but then end up towards the bottom of the final rankings. The reason for this is that their multiple choices have over-fitted the validation set (effectively learning the random variations in the small validation set).
If you can't keep a statistically pure test set, then I'm afraid the two best options are (i) collect some new data to make a new statistically pure test set or (ii) make the caveat that the new model was based on a choice made after observing the test set error, so the performance estimate is likely to have an optimistic bias.
What you do, in its simplest form, is model selection, where you explore different ML models, or hyper-parameter configurations and decide the best one with your success criterion to move on. Here, you can also perform cross validation for being more accurate, and statistically reliable than using a single validation set. Other than your question, note that, when comparing classification algorithms, accuracy might not be a good choice for the reasons listed in here, especially in imbalanced datasets (but not restricted to).
Best Answer
Not all statistical procedures split in to training/testing data, also called "cross-validation" (although the entire procedure involves a little more than that).
Rather, this is a technique that specifically is used to estimate out-of-sample error; i.e. how well will your model predict new outcomes using a new dataset? This becomes a very important issue when you have, for example, a very large number of predictors relative to the number of samples in your dataset. In such cases, it is really easy to build a model with great in-sample error but terrible out of sample error (called "over fitting"). In the cases where you have both a large number of predictors and a large number of samples, cross-validation is a necessary tool to help assess how well the model will behave when predicting on new data. It's also an important tool when choosing between competing predictive models.
On another note, cross-validation is almost always just used when trying to build a predictive model. In general, it is not very helpful for models when you are trying to estimate the effect of some treatment. For example, if you are comparing the distribution of tensile strength between materials A and B ("treatment" being material type), cross validation will not be necessary; while we do hope that our estimate of treatment effect generalizes out of sample, for most problems classic statistical theory can answer this (i.e. "standard errors" of estimates) more precisely than cross-validation. Unfortunately, classical statistical methodology1 for standard errors doesn't hold up in the case of overfitting. Cross-validation often does much better in that case.
On the other hand, if you are trying to predict when a material will break based on 10,000 measured variables that you throw into some machine learning model based on 100,000 observations, you'll have a lot of trouble building a great model without cross validation!
I'm guessing in a lot of the physics experiments done, you are generally interested in estimation of effects. In those cases, there is very little need for cross-validation.
1One could argue that Bayesian methods with informative priors are a classical statistical methodology that addresses overfitting. But that's another discussion.
Side note: while cross-validation first appeared in the statistics literature, and is definitely used by people who call themselves statisticians, it's become a fundamental required tool in the machine learning community. Lots of stats models will work well without the use of cross-validation, but almost all models that are considered "machine learning predictive models" need cross-validation, as they often require selection of tuning parameters, which is almost impossible to do without cross-validation.