Standard ANOVA estimates the common variance from the data. It uses a "pooled" estimate by taking the differences between each data point and the group mean for that data point, squaring those differences, summing, then dividing by the degrees of freedom (which is the number of data points minus the number of groups, for simple one-way ANOVA).
It would be possible to derive an equivalent of ANOVA where the common variance is known (probably based on the $\chi^2$ distribution ranther than the F), but the liklihood of finding a real world situation where the variance is known but the means are not is low enough that most people don't worry about that situation.
In truth it is probably never true that the populations are exactly normal or that the variances are exactly equal. The Central Limit Theorem covers the normality assumption for normal enough data and big enough sample sizes. The ANOVA tests have been shown to be fairly robust to the assumption that the variances are equal as long as they are at least similar (a common rule of thumb is that ANOVA is ok as long as the ratio of the largest variance and smallest variance is less than 4).
The general linear model lets us write an ANOVA model as a regression model. Let's assume we have two groups with two observations each, i.e., four observations in a vector $y$. Then the original, overparametrized model is $E(y) = X^{\star} \beta^{\star}$, where $X^{\star}$ is the matrix of predictors, i.e., dummy-coded indicator variables:
$$
\left(\begin{array}{c}\mu_{1} \\ \mu_{1} \\ \mu_{2} \\ \mu_{2}\end{array}\right) = \left(\begin{array}{ccc}1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1\end{array}\right) \left(\begin{array}{c}\beta_{0}^{\star} \\ \beta_{1}^{\star} \\ \beta_{2}^{\star}\end{array}\right)
$$
The parameters are not identifiable as $((X^{\star})' X^{\star})^{-1} (X^{\star})' E(y)$ because $X^{\star}$ has rank 2 ($(X^{\star})'X^{\star}$ is not invertible). To change that, we introduce the constraint $\beta_{1}^{\star} = 0$ (treatment contrasts), which gives us the new model $E(y) = X \beta$:
$$
\left(\begin{array}{c}\mu_{1} \\ \mu_{1} \\ \mu_{2} \\ \mu_{2}\end{array}\right) = \left(\begin{array}{cc}1 & 0 \\ 1 & 0 \\ 1 & 1 \\ 1 & 1\end{array}\right) \left(\begin{array}{c}\beta_{0} \\ \beta_{2}\end{array}\right)
$$
So $\mu_{1} = \beta_{0}$, i.e., $\beta_{0}$ takes on the meaning of the expected value from our reference category (group 1). $\mu_{2} = \beta_{0} + \beta_{2}$, i.e., $\beta_{2}$ takes on the meaning of the difference $\mu_{2} - \mu_{1}$ to the reference category. Since with two groups, there is just one parameter associated with the group effect, the ANOVA null hypothesis (all group effect parameters are 0) is the same as the regression weight null hypothesis (the slope parameter is 0).
A $t$-test in the general linear model tests a linear combination $\psi = \sum c_{j} \beta_{j}$ of the parameters against a hypothesized value $\psi_{0}$ under the null hypothesis. Choosing $c = (0, 1)'$, we can thus test the hypothesis that $\beta_{2} = 0$ (the usual test for the slope parameter), i.e. here, $\mu_{2} - \mu_{1} = 0$. The estimator is $\hat{\psi} = \sum c_{j} \hat{\beta}_{j}$, where $\hat{\beta} = (X'X)^{-1} X' y$ are the OLS estimates for the parameters. The general test statistic for such $\psi$ is:
$$
t = \frac{\hat{\psi} - \psi_{0}}{\hat{\sigma} \sqrt{c' (X'X)^{-1} c}}
$$
$\hat{\sigma}^{2} = \|e\|^{2} / (n-\mathrm{Rank}(X))$ is an unbiased estimator for the error variance, where $\|e\|^{2}$ is the sum of the squared residuals. In the case of two groups $\mathrm{Rank}(X) = 2$, $(X'X)^{-1} X' = \left(\begin{smallmatrix}.5 & .5 & 0 & 0 \\-.5 & -.5 & .5 & .5\end{smallmatrix}\right)$, and the estimators thus are $\hat{\beta}_{0} = 0.5 y_{1} + 0.5 y_{2} = M_{1}$ and $\hat{\beta}_{2} = -0.5 y_{1} - 0.5 y_{2} + 0.5 y_{3} + 0.5 y_{4} = M_{2} - M_{1}$. With $c' (X'X)^{-1} c$ being 1 in our case, the test statistic becomes:
$$
t = \frac{M_{2} - M_{1} - 0}{\hat{\sigma}} = \frac{M_{2} - M_{1}}{\sqrt{\|e\|^{2} / (n-2)}}
$$
$t$ is $t$-distributed with $n - \mathrm{Rank}(X)$ df (here $n-2$). When you square $t$, you get $\frac{(M_{2} - M_{1})^{2} / 1}{\|e\|^{2} / (n-2)} = \frac{SS_{b} / df_{b}}{SS_{w} / df_{w}} = F$, the test statistic from the ANOVA $F$-test for two groups ($b$ for between, $w$ for within groups) which follows an $F$-distribution with 1 and $n - \mathrm{Rank}(X)$ df.
With more than two groups, the ANOVA hypothesis (all $\beta_{j}$ are simultaneously 0, with $1 \leq j$) refers to more than one parameter and cannot be expressed as a linear combination $\psi$, so then the tests are not equivalent.
Best Answer
Consider the following setup. We have a $p$-dimensional parameter vector $\theta$ that specifies the model completely and a maximum-likelihood estimator $\hat{\theta}$. The Fisher information in $\theta$ is denoted $I(\theta)$. What is usually referred to as the Wald statistic is
$$(\hat{\theta} - \theta)^T I(\hat{\theta}) (\hat{\theta} - \theta)$$
where $I(\hat{\theta})$ is the Fisher information evaluated in the maximum-likelihood estimator. Under regularity conditions the Wald statistic follows asymptotically a $\chi^2$-distribution with $p$-degrees of freedom when $\theta$ is the true parameter. The Wald statistic can be used to test a simple hypothesis $H_0 : \theta = \theta_0$ on the entire parameter vector.
With $\Sigma(\theta) = I(\theta)^{-1}$ the inverse Fisher information the Wald test statistic of the hypothesis $H_0 : \theta_1 = \theta_{0,1}$ is $$\frac{(\hat{\theta}_1 - \theta_{0,1})^2}{\Sigma(\hat{\theta})_{ii}}.$$ Its asymptotic distribution is a $\chi^2$-distribution with 1 degrees of freedom.
For the normal model where $\theta = (\mu, \sigma^2)$ is the vector of the mean and the variance parameters, the Wald test statistic of testing if $\mu = \mu_0$ is $$\frac{n(\hat{\mu} - \mu_0)^2}{\hat{\sigma}^2}$$ with $n$ the sample size. Here $\hat{\sigma}^2$ is the maximum-likelihood estimator of $\sigma^2$ (where you divide by $n$). The $t$-test statistic is $$\frac{\sqrt{n}(\hat{\mu} - \mu_0)}{s}$$ where $s^2$ is the unbiased estimator of the variance (where you divide by the $n-1$). The Wald test statistic is almost but not exactly equal to the square of the $t$-test statistic, but they are asymptotically equivalent when $n \to \infty$. The squared $t$-test statistic has an exact $F(1, n-1)$-distribution, which converges to the $\chi^2$-distribution with 1 degrees of freedom for $n \to \infty$.
The same story holds regarding the $F$-test in one-way ANOVA.