In a scenario with $N$ samples and $K$ classes or labels, The first formula should be
$$\frac{1}{N-K} \sum_{c=1}^K \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
and is for calculating the pooled variance, to be used if you're tying the covariance matrix across classes (as in LDA). The $N-K$ term arises from Bessel's correction.
If you're not tying the covariance matrices (as in QDA), then the covariance matrix for a class $c$ with $N_c$ samples is
$$\frac{1}{N_c - 1} \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
if you want an unbiased estimate of the variance, or
$$\frac{1}{N_c} \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
if you want an MSE estimate of the variance.
Either way, usually you don't calculate the equation of the decision boundary in QDA. Given a test point you just evaluate the posterior probability of each class, and pick the highest.
I'm addressing only to one aspect of the question, and doing it intuitively without algebra.
If the $g$ classes have the same variance-covariance matrices and differ only by the shift of their centroids in the $p$-dimensional space then they are completely linearly separable in the $q=min(g-1,p)$ "subspace". This is what LDA is doing. Imagine you have three identical ellipsoids in the space of variables $V_1, V_2, V_3$. You have to use the information from all the variables in order to predict the class membership without error. But due to the fact that these were identically sized and oriented clouds it is possible to rescale them by a common transform into balls of unit radius. Then $q=g-1=2$ independent dimensions will suffice to predict the class membership as precisely as formerly. These dimensions are called discriminant functions $D_1, D_2$. Having 3 same-size balls of points you need only 2 axial lines and to know the balls' centres coordinates onto them in order to assign every point correctly.
Discriminants are uncorrelated variables, their within-class covariance matrices are ideally identity ones (the balls). Discriminants form a subspace of the original variables space - they are their linear combinations. However, they are not rotation-like (PCA-like) axes: seen in the original variables space, discriminants as axes are not mutually orthogonal.
So, under the assumption of homogeneity of within-class variance-covariances LDA using for classification all the existing discriminants is no worse than classifying immediately by the original variables. But you don't have to use all the discriminants. You might use only $m<q$ first most strong / statistically significant of them. This way you lose minimal information for classifying and the missclassification will be minimal. Seen from this perspective, LDA is a data reduction similar to PCA, only supervised.
Note that assuming the homogeneity (+ multivariate normality) and provided that you plan to use but all the discriminants in classification it is possible to bypass the extraction of the discriminants themselves - which involves generalized eigenproblem - and compute the so called "Fisher's classification functions" from the variables directly, in order to classify with them, with the equivalent result. So, when the $g$ classes are identical in shape we could consider the $p$ input variables or the $g$ Fisher's functions or the $q$ discriminants as all equivalent sets of "classifiers". But discriminants are more convenient in many respect.$^1$
Since usually the classes are not "identical ellipses" in reality, the classification by the $q$ discriminants is somewhat poorer than if you do Bayes classification by all the $p$ original variables. For example, on this plot the two ellipsoids are not parallel to each other; and one can visually grasp that the single existing discriminant is not enough to classify points as accurately as the two variables allow to. QDA (quadratic discriminant analysis) would be then a step better approximation than LDA. A practical approach half-way between LDA and QDA is to use LDA-discriminants but use their observed separate-class covariance matrices at classification (see,see) instead of their pooled matrix (which is the identity).
(And yes, LDA can be seen as closely related to, even a specific case of, MANOVA and Canonical correlation analysis or Reduced rank multivariate regression - see, see, see.)
$^1$ An important terminological note. In some texts the $g$ Fisher's classification functions may be called "Fisher's discriminant functions", which may confuse with the $q$ discriminats which are canonical discriminant functions (i.e. obtained in the eigendecomposition of $\bf W^{-1}B$). For clarity, I recommend to say "Fisher's classification functions" vs "canonical discriminant functions" (= discriminants, for short). In modern understanding, LDA is the canonical linear discriminant analysis. "Fisher's discriminant analysis" is, at least to my awareness, either LDA with 2 classes (where the single canonical discriminant is inevitably the same thing as the Fisher's classification functions) or, broadly, the computation of Fisher's classification functions in multiclass settings.
Best Answer
My answer would be no you cannot use the usual form of LDA or QDA if your data points are binary. The reason for this is the following:
In his lecture notes, Andrew Ng specifically states that the assumption behind LDA is that the X data is continuous
See here: