Constant failure rate within a given proportion, independent failure events
If the failure rate is constant under a given set of conditions and the tests are independent, then the occurrence of failures follows a Bernoulli process.
In that case, you can generate intervals for the proportion of failures assuming a binomial proportion.
The standard error of the proportion is $\sqrt{ p (1- p)/n}$, which is usually estimated as $\sqrt{\hat p (1-\hat p)/n}$, where $\hat p$ is the sample proportion.
With large sample size (as long as the proportion isn't too small), you might as well use the normal approximation; for an interval of coverage $1-\alpha$, the interval for $p$, the population proportion is
$$\hat p\pm z_{1-\frac{\alpha}{2}}\sqrt{\hat p (1-\hat p)/n}$$
if $p$ is small (so small that $np$ is less than 20 say), you might be better to choose one of the other intervals, such as the Wilson interval or the Clopper-Pearson interval.
In the situation where you have all-0, clearly $np$ is likely to be much less than 20 -- or even 5. Normal approximations cannot be used! A few of the other intervals do okay in this case (though some still need to be truncated to 0 on the left side), but a common approach* is the rule of three. Similar comments apply for all-1 by interchanging success and failure.
*(assuming a 95% interval, though the approach adapts to other coverage probabilities)
If the underlying rate is not constant within a calculated proportion, or there is dependence, then the nature of the variation in rate or the dependence structure (respectively) would usually need to be characterized in some way.
If we can assume independence between trials, we can quantify whether there's over-or-under dispersion relative to the constant-p assumption, and we can look for trends (changing average over time) in binomial proportions, via (say) smoothing splines in logistic regression. If there are, you need to be careful about what your hypotheses might be (you're not generally testing just a difference then).
If you're prepared to assume a mixture distribution with constant combination of binomial proportions within each group being compared (which may be untenable), it might be reasonable to consider some kind of permutation (when testing) or bootstrapping (when constructing intervals) approach.
On the other hand, if we can assume constant $p$, we can check for serial dependence fairly readily.
It all comes down to distance in the feature space into which your projecting your data. The $k$ in $k$NN tells the algorithm how many nearest neighbors, in terms of distance in your feature space, should be used to determine the class of an unknown data point, often by majority vote. Thinking of it in this way makes your observations more intuitive, I think: for your dataset, taking into account the class label of the two most similar, or nearest, neighbors to a data point seems to be a good way to go. Taking into account one is likely to small to overcome the variability of feature values in each class. Similarly, as you increase the number of data point contributing to the label of your new data point, you take into account points that are less and less similar to the point in question. This is why the experiment you ran is so important for $k$NN experiments! It is good practice to run cross-validation on a hold-out optimization data set over multiple $k$-values, to estimate the best choice for your data. Data sets in different domains—or even problem using data sets in the same domain, but different feature representations—are going to differ.
Best Answer
The data turned out to be generated by some cosine-alike function with a low density. This caused the nearest neighbours classifier to perform well for small $k$ (it matches the cosine, because it takes into account the points that were far into the "other-class-zone") and for large $k$ (it matches an almost linear function because it ignores ALL points that are on the "other side"). Everything in between confuses the classifier, because depending on which points are training and which are test data, the classification border alters a lot. This causes the variance and the bias to be higher for a moderate $k$ than for extreme low/high $k$...