Excellent questions.
Regarding A:
A sufficient statistic is nothing more than a distillation of the information that is contained in the sample with respect to a given model. As you would expect, if you have a sample $x_i \sim N(\mu,\sigma^2)$ for $i \in \{1, \ldots, N\}$ and each independent, it is clear that so long as we calculate the sample mean and sample variance, it doesn't matter what the values of each $x_i$ are. In linear regression (easier to talk about than logistic in this context), the sampling distribution of the unknown coefficient vector (for known variance) is $N(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \sigma^2\mathbf{X}^\top\mathbf{X})^{-1})$, so as long as these final quantities are identical, inference based thereupon while be too. This is the idea of sufficiency.
Note that in the $N(\mu,\sigma^2)$ example, the sufficient statistic comprises of just two numbers: $\hat{\mu}=\frac{1}{N}\sum_{i=1}^N x_i$ and $\frac{1}{N}\sum_{i=1}^N (x_i-\hat{\mu})^2$, no matter how big our sample size $N$ is (and assuming $N>2$). Likewise, the vector $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is of dimension $P$ and $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ of dimension $P\times P$ (here $P$ is the dimension of the design matrix), which are both independent of $N$ (though, technically, the matrix $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is just a constant under our assumptions). So in these examples, the sufficient statistic has a fixed number of values (not fixed values), or as I would put it, fixed dimension.
Let's note three more things. First, that there is no such thing as the sufficient statistic for a distribution, rather, there are many possible statistics which may be sufficient, and which may be of different dimension. Indeed, our second thing to discuss is that the entire sample itself, since it naturally contains all information contained in itself, is always a sufficient statistic. This is a trivial case, but an important one, as in general one cannot always expect to find a sufficient statistic of dimension less than $N$. And the final thing to note is model specificity: that's why I wrote with respect to a given model above. Changing your likelihood will change the sufficient statistics, at least potentially, for a given dataset.
Regarding B: What you're saying is correct, but additionally to allowing analytic posteriors in the univariate case, conjugacy has serious benefits in the context of Bayesian hierarchical models estimated via MCMC. This is because conditional posteriors are also available in closed form. So we can actually accelerate Metropolis-within-Gibbs style MCMC algos with conjugacy.
Regarding C: It's definitely a similar idea, but I do want to make clear that we're talking about two different distributions here: "posterior" versus "posterior predictive". As the name implies, both of these are posterior distributions, which means that they are distributions of an unknown variable conditioned on our known data. A "posterior" plain and simple usually refers to something like $P(\mu, \sigma^2| \{x_1, \ldots, x_N\})$ from our normal example above: a distribution of unkown parameters defined in the data generating distributions. In contrast, a "posterior predictive" gives the distribution of a hypothetical $N+1$'st data point $x_{N+1}$ conditional on the observed data: $P(x_{N+1}| \{x_1, \ldots, x_N\})$. Notice that this is not conditional on the parameters $\mu$ and $\sigma^2$: they had to be integrated out. It is this additional integral that is guaranteed by conjugacy.
Regarding D: In the context of Variational Bayes (VB), you have some posterior distribution $P(\theta|X)$ where $\theta$ is some vector of $P$ parameters and $X$ are some data. Rather than trying to generate a sample from it like MCMC, we are instead going to use an approximate posterior distribution that's easy to work with and pretty close to the true one. That's called a variational distribution and is denoted $Q_\eta(\theta)$. Notice that our variational distribution is indexed by variational parameters $\eta$. Variational parameters are nothing like the parameters we do Bayesian inference on, and are nothing like our data. They don't have a distribution associated with them and they don't have some hypothetical role generating the data. Rather, they are chosen as a result of an iterative optimization algorithm. The whole idea of variational inference is to define some measure of dissimilarity between the variational distribution and the true posterior and then minimize that measure with respect to the parameters $\eta$. We'll denote the result of that optimization process by $\hat{\eta}(X)$. At that point, hopefully $Q_{\hat{\eta}(X)}(\theta)$ is pretty close to $P(\theta|X)$, and if we do inferences using $Q_{\hat{\eta}(X)}(\theta)$ instead we'll get similar answers.
Now where does conjugacy fit in? A popular way to measure dissimilarity is this measure, which is called the reverse KL cost:
$$ \hat{\eta}(X) := \underset{\eta}{\textrm{argmin}}\, \mathbb{E}_{\theta\sim Q_\eta}\bigg[\frac{\log Q_{\eta}(\theta)}{\log P(\theta|X)}\bigg] $$
This integral cannot be solved in terms of simple functions in general. However, it is available in closed form when:
We use a conjugate prior to define $P(\theta|X)$.
We assume that variational distribution is independent, so in other words that $q_\eta(\theta)=\prod_{j=1}^P q_{j,\eta}(\theta_j)$.
We further restrict ourselves to a particular $q_{j,\eta_j}$ for each $j$ (which is determined by the likelihood).
So it's not that the variational posterior is available in closed form. Rather, it's that the cost function which defines the variational posterior is available in closed form. The cost function being closed form makes computing the variational distribution an easier optimization problem, since we can analytically compute function values and gradients.
Best Answer
Calculating posteriors with general/arbitrary priors directly may be a difficult task.
On the other hand, calculating posteriors with mixtures of conjugate priors is relatively simple, since a given mixture of priors becomes the same mixture of the corresponding posteriors.
[There are also many cases where some given prior may be quite well approximated by a finite mixture of conjugate priors -- this makes for a very easy to apply and practical approach in many situations, that leads to approximate posteriors that may be made quite close to the exact one.]