Is there an advantage to using higher dimensions (2D, 3D, etc) or should you just build x-1 single dimension classifiers and aggregate their predictions in some way?
This depends on whether your features are informative or not. Do you suspect that some features will not be useful in your classification task? To gain a better idea of your data, you can also try to compute pairwise correlation or mutual information between the response variable and each of your features.
To combine all (or a subset) of your features, you can try computing the L1 (Manhattan), or L2 (Euclidean) distance between the query point and each 'training' point as a starting point.
Since building all of these classifiers from all potential combinations of the variables would be computationally expensive. How could I optimize this search to find the the best kNN classifiers from that set?
This is the problem of feature subset selection. There is a lot of academic work in this area (see Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182. for a good overview).
And, once I find a series of classifiers what's the best way to combine their output to a single prediction?
This will depend on whether or not the selected features are independent or not. In the case that features are independent, you can weight each feature by its mutual information (or some other measure of informativeness) with the response variable (whatever you are classifying on). If some features are dependent, then a single classification model will probably work best.
How do most implementations apply kNN to a more generalized learning?
By allowing the user to specify their own distance matrix between the set of points. kNN works well when an appropriate distance metric is used.
Your validation process and the reasoning behind is entirely correct.
Using the same reasoning / model building process: After you have selected a k by validation, you build the final model using all the training data $X$ and calculate the mean and variance based only on $X$ since these values are also part of the model.
Additionally: A classification model should classify new unlabeled instances independent of each other. But if you calculate mean and variance also based on $Z$, then the prediction might change for the same unlabeled instance dependent on how the rest of $Z$ looks like. This is not correct.
I guess the confusion originates from k-nearest-neighbor being a lazy learner, i.e. storing all the instances instead of deriving a model with reduced complexity. In other learners, this is not done, so calculating the normalization parameters across the whole combined set is not even possible. See this related question without a specific learner: Perform feature normalization before or within model validation?
Best Answer
I think that depends on the data. If you know your feature is bounded, you could scale it to $[0,1]$. If it's binary I guess $\{0,1\}$ is a good choice, perhaps $\{-1,1\}$. Now, if it's unbounded, the standardization to $\text Z$-scores $\overline x = 0$, $\sigma=1$ is a reasonable choice.