I'm not sure how to interpret the diagram, but for asymmetric/messy problems the bootstrap is often your friend. Suppose that you want to get a confidence interval on the difference between $R^2$ from two different models on two different or overlapping datasets, and that the number of independent experimental units for the two is $n_{1}, n_{2}$ with there being $n_{u}$ unique experimental units from the union of the two samples. You could sample with replacement from the $n_{u}$ units 1000 times, each time recreating dataset 1 on the basis of which and how often the $n_{1}$ units were selected, and likewise for dataset 2. For each resample estimate two sets of model parameters and two $R^2$ and their difference. Get a bootstrap confidence interval for the difference using the 1000 estimated differences.
I don't think that you can accomplish exactly what you want with respect to the set of KNN models based on different distance metrics on your single data set, but you can try to evaluate the relative performance of the modeling approaches based on the different distance metrics. You will, however, have to make two adjustments.
Much of what follows is informed by the discussion on this page.
First, you should evaluate performance with a proper scoring rule like the Brier score instead of accuracy, specificity, sensitivity, and F1 score. Those are notoriously poor measures for comparing models, and they make implicit assumptions about the cost tradeoffs between different types of classification errors.* The Brier score is effectively the mean square error between the predicted probabilities of class membership and actual membership. You will have to see how your KNN software provides access to the class probabilities, but this is typically possible as in this sklearn
function.
Second, instead of simply fitting the model one time to your data, you need to see how well the modeling process works in repeated application to your data set. One way to proceed would be to work with multiple bootstrap samples, say a few hundred to a thousand, of the data. For each bootstrap sample as a training set, build KNN models with each of your distance metrics, then evaluate their performances on the entire original data set as the test set. The distribution of Brier scores for each type of model over the few hundred to a thousand bootstraps could then indicate significant differences, among the models based on different distance metrics, in terms of that proper scoring rule.
Even this approach has its limits, however; see this answer by cbeleities for further discussion.
*Using accuracy (fraction of cases correctly assigned) as the measure of model performance makes an implicit assumption that false negatives and false positives have the same importance. See this page for further discussion. In practical applications this assumption can be unhelpful. One example is the overdiagnosis and overtreatment of prostate cancer; false-positives in the usual diagnostic tests have led to many men who were unlikely to have died from this cancer nevertheless undergoing life-altering therapies with frequently undesirable side effects.
The F1-score does not take true negative cases/rates into account, which might be critical in some applications. Sensitivity and specificity values depend on a particular choice of tradeoff between them. Sometimes that tradeoff is made silently by software, for example setting the classification cutoff in logistic regression at a predicted value of $p>0.5$. The explicit or hidden assumptions underlying all of these measures mean that they can be affected dramatically by small changes in the assumptions.
The most generally useful approach is to produce a good model of class membership probabilities, then use judgements about the costs of tradeoffs to inform final assignments of predicted classes (if needed). The Brier score and other proper scoring rules provide continuous measures of the quality of a probability model that are optimized when the model is the true model.
Best Answer
One of the linked posts above alludes to using a likelihood ratio test, although your models have to be nested in one another for this to work (i.e. all the parameters in one of the models must be present in the model you are testing it against).
RMSE is clearly a measure of how well the model fits the data. However, so is likelihood ratio. The likelihood for a given person, say Mrs. Chen, is the probability that a person with all her parameters had the outcome she had. The joint likelihood of the dataset is Mrs. Chen's likelihood * Mrs. Gundersen's likelihood * Mrs. Johnson's likelihood * ... etc.
Adding a covariate, or any number of covariates, can't really make the likelihood ratio worse, I don't think. But it can improve the likelihood ratio by a non-significant amount. Models that fit better will have a higher likelihood. You can formally test whether model A fits model B better. You should have some sort of LR test function available in whatever software you use, but basically, the LR test statistic is -2 * the difference of the logs of the likelihoods, and it's distributed chi-square with df = the difference in the number of parameters.
Also, comparing the AIC or BIC of the two models and finding the lowest one is also acceptable. AIC and BIC are basically the log likelihoods penalized for number of parameters.
I'm not sure about using a t-test for the RMSEs, and I would actually lean against it unless you can find some theoretical work that's been done in the area. Basically, do you know how the values of RMSE are asymptotically distributed? I'm not sure. Some further discussion here:
http://www.stata.com/statalist/archive/2012-11/index.html#01017