This part primarily relates to your first, third and fourth question:
There's a fundamental difference between Bayesian statistics and frequentist statistics.
Frequentist statistics makes inference about which fixed parameter values are consistent with data viewed as random, usually via the likelihood. You take $\theta$ (some parameter or parameters) as fixed but unknown, and see which ones make the data more likely; it looks at the properties of sampling from some model given the parameters to make inference about where the parameters might be. (A Bayesian might say the frequentist approach is based on 'the frequencies of things that didn't happen')
Bayesian statistics looks at the information on parameters in terms of a probability distribution on them, which is updated by data, via the likelihood. Parameters have distributions, so you look at $P(\theta|\underline{x})$.
This results in things which often look similar but where the variables in one look "the wrong way around" viewed through the lens of the other way of thinking about it.
So, fundamentally they're somewhat different things, and the fact that things that are on the LHS of one are on the RHS of the other is no accident.
If you do some work with both, it soon becomes reasonably clear.
The second question seems to me to relate simply to a typo.
---
the statement "equivalent to the usual frequentist sampling distribution, that is" : I took this to mean that the authors were stating the frequentist sampling distribution. Have I read this wrongly?
There's two things going on there - they've expressed something a bit loosely (people do this particular kind of over-loose expression all the time), and I think you're also interpreting it differently from the intent.
What exactly does the expression they give mean, then ?
Hopefully the discussion below will help clarify the intended sense.
If you can provide a reference (pref. online as I don't have good library access) where this expression is derived I would be grateful.
It follows right from here:
http://en.wikipedia.org/wiki/Bayesian_linear_regression
by taking flat priors on $\beta$ and I think a flat prior for $\sigma^2$ as well.
The reason is that the posterior is thereby proportional to the likelihood and the intervals generated from the posteriors on the parameters match the frequentist confidence intervals for the parameters.
You might find the first few pages here helpful as well.
Best Answer
We can prove this for more general case of $p$ variables by using the "hat matrix" and some of its useful properties. These results are usually much harder to state in non matrix terms because of the use of the spectral decomposition.
Now in matrix version of least squares, the hat matrix is $H=X(X^TX)^{-1}X^T$ where $X$ has $n$ rows and $p+1$ columns (column of ones for $\beta_0$). Assume full column rank for convenience - else you could replace $p+1$ by the column rank of $X$ in the following. We can write the fitted values as $\hat{Y}_i=\sum_{j=1}^nH_{ij}Y_j$ or in matrix notation $\hat{Y}=HY$. Using this, we can write the sum of squares as:
$$\frac{\sum_{i=1}(Y-\hat{Y_i})^2}{\sigma^2}=\frac{(Y-\hat{Y})^T(Y-\hat{Y})}{\sigma^2}=\frac{(Y-HY)^T(Y-HY)}{\sigma^2}$$ $$=\frac{Y^T(I_n-H)Y}{\sigma^2}$$
Where $I_n$ is an identity matrix of order $n$. The last step follows from the fact that $H$ is an idepotent matrix, as $$H^2=[X(X^TX)^{-1}X^T][X(X^TX)^{-1}X^T]=X(X^TX)^{-1}X^T=H=HH^T=H^TH$$
Now a neat property of idepotent matrices is that all of their eigenvalues must be equal to zero or one. Letting $e$ denote a normalised eigenvector of $H$ with eigenvalue $l$, we can prove this as follows:
$$He=le\implies H(He)=H(le)$$ $$LHS=H^2e=He=le\;\;\; RHS=lHe=l^2e$$ $$\implies le=l^2e\implies l=0\text{ or }1$$
(note that $e$ cannot be zero as it must satisfy $e^Te=1$) Now because $H$ is idepotent, $I_n-H$ also is, because
$$(I_n-H)(I_n-H)=I-IH-HI+H^2=I_n-H$$
We also have the property that the sum of the eigenvalues equals the trace of the matrix, and $$tr(I_n-H)=tr(I_n)-tr(H)=n-tr(X(X^TX)^{-1}X^T)=n-tr((X^TX)^{-1}X^TX)$$ $$=n-tr(I_{p+1})=n-p-1$$
Hence $I-H$ must have $n-p-1$ eigenvalues equal to $1$ and $p+1$ eigenvalues equal to $0$.
Now we can use the spectral decomposition of $I-H=ADA^T$ where $D=\begin{pmatrix}I_{n-p-1} & 0_{[n-p-1]\times[p+1]}\\0_{[p+1]\times [n-p-1]} & 0_{[p+1]\times [p+1]}\end{pmatrix}$ and $A$ is orthogonal (because $I-H$ is symmetric) . A further property which is useful is that $HX=X$. This helps narrow down the $A$ matrix
$$HX=X\implies(I-H)X=0\implies ADA^TX=0\implies DA^TX=0$$ $$\implies (A^TX)_{ij}=0\;\;\;i=1,\dots,n-p-1\;\;\; j=1,\dots,p+1$$
and we get:
$$\frac{\sum_{i=1}(Y-\hat{Y_i})^2}{\sigma^2}=\frac{Y^TADA^TY}{\sigma^2}=\frac{\sum_{i=1}^{n-p-1}(A^TY)_i^2}{\sigma^2}$$
Now, under the model we have $Y\sim N(X\beta,\sigma^2I)$ and using standard normal theory we have $A^TY\sim N(A^TX\beta,\sigma^2A^TA)\sim N(A^TX\beta,\sigma^2I)$ showing that the components of $A^TY$ are independent. Now using the useful result, we have that $(A^TY)_i\sim N(0,\sigma^2)$ for $i=1,\dots,n-p-1$. The chi-square distribution with $n-p-1$ degrees of freedom for the sum of squared errors follows immediately.