Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $\sigma^2_a$ and $\sigma^2_b$ for $\mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $\sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $\mathbf a$, as an extra component. In other words, $\mathbf a_{\rm mine} = (\mathbf a_{\rm yours}, b) \in \mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $\mathbf x_i$, i.e. $\mathbf x_{i, \rm mine} = (\mathbf x_{i, \rm yours}, 1) \in \mathbf R^{p + 1}$.
The prior variance for $\mathbf a$ is then a $(p+1)\times(p+1)$ diagonal matrix $\mathbf \Sigma$, whose first $p$ diagonal entries are $\sigma_a^2$ and whose final entry is $\sigma_b^2$.
2. The $\mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $\mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $\mathbf a \in \mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(\mathbf a) = \mathcal N(\mathbf a | \mathbf 0, \mathbf \Sigma ). $$
Given the value of $\mathbf a$, and given the feature vectors $\mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | \mathbf x_i, \mathbf a) = \begin{cases} s(\mathbf a .\mathbf x_i) & {\rm if \ }y_i = 1 \\ 1 - s(\mathbf a . \mathbf x_i ) & {\rm if \ } y_i = 0\end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, \dots, y_N | \mathbf x_1, \dots, \mathbf x_N, \mathbf a ) = P(\mathbf a)\prod_{i=1}^N P(y_i | \mathbf x_i , \mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
\begin{multline} \log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \\ = {\rm const} + \frac 1 2 \mathbf a \mathbf \Sigma^{-1} \mathbf a + \sum_{i}\left(y_i \log s (\mathbf a_i . \mathbf x_i) + (1 - y_i) \log (1 - s(\mathbf a_i . \mathbf x_i))\right)\end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(\mathbf x_1, y_1), \dots, (\mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \propto P(y_1, \dots, y_N | \mathbf x_1, \dots, \mathbf x_N, \mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $\mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ \mathbf a_{\rm MAP} = {\rm argmax}_{\mathbf a} P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $\mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${\rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $\mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $\mathbf a$! The distribution $P(\mathbf a| y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $\mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ \log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \approx {\rm const } + \frac 1 2 \mathbf (\mathbf a - \mathbf a_{\rm MAP})^T \mathbf H \mathbf (\mathbf a - \mathbf a_{\rm MAP}),$$
where $\mathbf H$ is the Hessian matrix for $\log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N)$, evaluated at $\mathbf a_{\rm MAP}$.
Exponentiating this, we see that $\mathbf a$ is approximately normally distributed, with mean $\mathbf a_{\rm MAP}$ and variance $\mathbf H^{-1}$. Thus we have obtained not just a point estimate for $\mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $\mathbf a_{\rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $\mathbf x_{\rm new}$. Our task is to predict the probability of the corresponding label $y_{\rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{\rm new} = 1 | \mathbf x_{\rm new}, x_1, \dots, x_N, y_1, \dots, y_N) = \int d\mathbf a \ P(y_{\rm new} | \mathbf x_{\rm new}, \mathbf a) P(\mathbf a | y_1, \dots, y_N, x_1, \dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $\mathbf a_{\rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $\mathbf a$, but assigning higher weight to the predictions coming from choices of $\mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
The likelihood in a parametric model is the joint density (or mass function) of the observed data conditional on any parameters in the model (which are treated as random in a Bayes context).
So using $f$ to denote a density or mass function, and letting $x$ denote the joint $x_i$ data, $y$ denote the joint $y_i$ data, and $\theta$ denote parameters, I would write the likelihood as
$$f(y,x|\theta)=f(y|x,\theta)f(x|\theta).$$
Note that if the density of $x$ doesn't depend on $\theta$, maximizing likelihood over the parameter space is equivalent to maximizing $f(y|x,\theta),$ since the density of $x$ just scales this up.
As far as using Bayes' rule, we have
$$f(\theta|y,x)=\frac{f(y,x|\theta)f(\theta)}{f(y,x)}=\frac{f(y|x,\theta)f(\theta|x)f(x)}{f(y,x)},$$
and often in a Bayesian context you will see this is written up to proportionality since other terms are simply normalizing constants (since $\theta$ is the random variable of interest):
$$f(\theta|y,x)\propto f(y,x|\theta)f(\theta)\propto f(y|x,\theta)f(\theta|x).$$
If you are confused at any point, just write out everything in terms of joint and marginal densities or masses to convince yourself of the correct identity. Hopefully that helps.
Best Answer
I don't have enough reputation to comment on your post to request clarification as there exist some quantities you haven't defined. However, I am inferring from the notational conventions you have adopted that you are operating in the area of statistics/machine learning. On the queries you have outlined, it's best perhaps to clarify the likelihood function in this setting, MAP estimation and then the predictive distribution, highlighting aspects of your query as I proceed.
Context.
I am going to follow the notational conventions of the book Pattern Recognition and Machine Learning by Christopher Bishop. It should be remarked that the notation used in this book explicitly supports Bayesian interpretations of statistics, rather than frequentist interpretations.
Consider a training dataset $\{\mathbf{x}_n, t_n \}$ for $n = 1, ... ,N$ consisting of $N$ IID input vectors $\mathbf{x}_i$ and $N$ IID scalar target variables $t_n$. We stack the $N$ input vectors into a matrix $\mathbf{X}$, and the scalar target variables into a vector $\mathbf{t}$. Denote the parameter vector $\mathbf{w}$ which consists of regression coefficients.
We assume that we can model the target variable $t_n$ with the deterministic function $y(\mathbf{x}_n, \mathbf{w})$ with additive Gaussian noise (i.e. $\epsilon \sim \mathcal{N}(0, \beta^{-1})$ where the precision/inverse variance is $\beta$):
$$t_n = y(\mathbf{x}_n; \mathbf{w}) + \epsilon = \mathbf{w}^T\mathbf{x}_n + \epsilon$$
Due to the assumption of additive Gaussian noise, the likelihood of a single $t_n$ will be $p(t_n | \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(t_n | y(\mathbf{x}_i, \mathbf{w}), \mathbf{\beta}^{-1})$.
Emphasising that the target variables are multiple observations of a single target variable, rather than a single observation of a multivariate target vector, encoded in the independence of the scalar target variables $t_n$, we have the following likelihood function on $\mathbf{t}$:
$$p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) = \prod^N_{n=1} \mathcal{N}(t_n | \mathbf{w}^T \mathbf{x}_n, \beta^{-1})$$
Now we define a Gaussian prior on the parameter vector $\mathbf{w}$, so we have $p(\mathbf{w} | \alpha) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}I)$.
The likelihood function.
Now on the query concerning whether the likelihood should be $p(\mathcal{D} | \mathbf{w}) = p(\mathbf{t, X} | \mathbf{w})$ or whether it should be $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$.
I am fairly certain that is the latter rather than the former, through consideration of the context of the linear regression in statistics/supervised learning in machine learning. That is, we are interested in modelling the distribution of the target variables $\mathbf{t}$ using the information contained in the regressors $\mathbf{X}$ (i.e. conditioning on these variables), and on a suitable value of the parameter $\mathbf{w}$. As we have assumed that the $\mathbf{t}$ are a function of the $\mathbf{X}$ that is linear in the parameter $\mathbf{w}$, and additive Gaussian noise, the likelihood function $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$ arises naturally from this assumption. So to clarify, the likelihood $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$ is both a function of the input matrix $\mathbf{X}$ and of the parameter $\mathbf{w}$, and the latter remains to be estimated according to a procedure with desirable statistical properties.
Now I think the confusion possibly arises from the fact that you using the notation $\mathcal{D} = \{\mathbf{t}, \mathbf{X} \}$ to refer to the "dataset", which is an acceptable form of notation, but it runs the risk of masking the distinction, crucial to the linear regression/supervised learning context, between the parts of the dataset $\mathcal{D}$ you are seeking model i.e. $\mathbf{t}$, and the information you conditioning on, the input variables/regressors, $\mathbf{X}$.
Hence your specification of the posterior distribution in this setting should read:
$$p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta) \propto p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha)$$
And in Bishop you will find he drops the $\alpha$ and $\beta$ for notational brevity, and also the input matrix $\mathbf{X}$. The latter is because in supervised learning/discriminative modelling in machine learning we are not interested in modelling the distribution of the input variables $\mathbf{X}$. Yielding:
$$p(\mathbf{w} | \mathbf{t}) \propto p(\mathbf{t} | \mathbf{w}) p( \mathbf{w})$$
MAP estimation.
You are correct concerning MAP estimation being frequentist. Even though we take the Bayesian route and assume that the parameter $\mathbf{w}$ is a random variable, and we compute a posterior distribution on $\mathbf{w}$, the MAP estimator is a point estimator. That is we select a value of $\mathbf{w}$ so as to maximise the posterior $p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)$:
$$\mathbf{w}_{MAP} = \underset{\mathbf{w}}{\text{argmax}} \space p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta) = \underset{\mathbf{w}}{\text{argmax}} \space p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha)$$
And yes you are correct concerning the distinction between MAP estimation and maximum likelihood estimation. In the latter case you maximise the likelihood $p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta)$. However, in doing MAP estimation, you are estimating the parameter $\mathbf{w}$ under a different model where you have added $L_2$/Tikhonov regularisation, known as ridge regression. See references at the bottom for further info.
Predictive distribution.
In order to understand what is going on here, you need to make a distinction between the notation for the test-set, and the training set. So modifying your notation to denote a new input vector belonging to the test set as $\mathbf{x}_{0}$ we are interested in making a prediction of the target variable, which we denote $t_0$ (this is unknown).
Now via a marginalisation argument where we integrate out the parameter $\mathbf{w}$, the predictive distribution is:
$$\begin{align} p(t_0 | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) &= \int p(t_0, \mathbf{w} | \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \alpha, \beta) d\mathbf{w} \\ &= \int p(t_0 | \mathbf{w}, \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \beta) p(\mathbf{w} | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) d\mathbf{w} \\ &= \int \underbrace{p(t_0 | \mathbf{w}, \mathbf{x}_0, \beta)}_{\text{likelihood}} \underbrace{p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)}_{\text{posterior}} d\mathbf{w} \end{align}$$
Similar to above and in Bishop, you can drop all the $\alpha$ and $\beta$ for notational brevity. In going from the 1st to the 2nd equality we just probability rules. In going from the 2nd to the 3rd equality we use conditional independence:
$$p(t_0 | \mathbf{w}, \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \beta) = p(t_0 | \mathbf{w}, \mathbf{x}_0, \beta)$$
$$p(\mathbf{w} | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) = p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)$$
If you aren't comfortable with why we can invoke conditional independence just say in the comments and I will reply, as this post is getting too long for my liking. Notice that the two terms on the right above are just the likelihood and posterior we computed previously.
Concerning your query on what the predictive distribution $p(t_0 | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta)$ is supposed to represent.
It is just the product of the likelihood and posterior, with the parameter $\mathbf{w}$ integrated out. Intuitively, you are evaluating the likelihood of a value $t_0$ given $\mathbf{x}_0$ for a particular $\mathbf{w}$. Then weighting that likelihood by your current belief about the parameter $\mathbf{w}$ given data $\mathcal{D} = \{\mathbf{t}, \mathbf{X} \}$. Then integrating over all possible values $\mathbf{w}$ can take.
This not a fully Bayesian approach, but more Bayesian than say attempting to predict $t_0$ using the MAP point estimator $\mathbf{w}_{\text{MAP}}$ and computing $t_0 \approx \mathbf{w}_{\text{MAP}}^T\mathbf{x}_0$.
References
For appropriate references on Bayesian linear regression, look at Chapter 3 of "Pattern Recognition and Machine Learning" by Christopher Bishop (from which I have used notation, although some of the arguments I've made above aren't fully explained). And there is a good table on how Bayesian one can get on p175 of "Machine Learning: A Probabilistic Perspective" by Kevin Murphy. There is also very clear exposition from John Paisley which is excellent for the link between MAP estimation, ridge regression, and Bayesian linear regression here - see lecture 5 in particular . It's free to enroll so if you sign up you can download all the course materials for free.