It is a little easier notationally to work directly with the standard Normal distribution than with the lognormal distribution and error functions, and the conversion from the random number generated from a truncated standard Normal distribution to one generated from the desired truncated Lognormal distribution should be clear.
Let us label the lower and upper bounds (converted to the standard Normal distribution appropriately) as $(l, u)$ respectively. Then define $p_l = \Phi(l)$ and $p_u = \Phi(u)$, the values of the cumulative density function at $l$ and $u$. Generate random numbers $z$ as follows:
- $x \leftarrow U(0,1)$
- $x' \leftarrow p_l + (p_u - p_l)x$
- $z \leftarrow \Phi^{-1}(x')$
The second assignment creates $x' \sim U(p_l, p_u)$, and running this through the inverse CDF function of a standard Normal variate will produce a random number that has a standard Normal distribution truncated at $(l, u)$.
Of course, if you have access to a function that will calculate the inverse CDF of a Lognormal distribution with specified parameters, you can work directly with that in step 3, saving yourself any effort of converting from the Lognormal to the standard Normal and back again.
Excellent questions.
Regarding A:
A sufficient statistic is nothing more than a distillation of the information that is contained in the sample with respect to a given model. As you would expect, if you have a sample $x_i \sim N(\mu,\sigma^2)$ for $i \in \{1, \ldots, N\}$ and each independent, it is clear that so long as we calculate the sample mean and sample variance, it doesn't matter what the values of each $x_i$ are. In linear regression (easier to talk about than logistic in this context), the sampling distribution of the unknown coefficient vector (for known variance) is $N(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \sigma^2\mathbf{X}^\top\mathbf{X})^{-1})$, so as long as these final quantities are identical, inference based thereupon while be too. This is the idea of sufficiency.
Note that in the $N(\mu,\sigma^2)$ example, the sufficient statistic comprises of just two numbers: $\hat{\mu}=\frac{1}{N}\sum_{i=1}^N x_i$ and $\frac{1}{N}\sum_{i=1}^N (x_i-\hat{\mu})^2$, no matter how big our sample size $N$ is (and assuming $N>2$). Likewise, the vector $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is of dimension $P$ and $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ of dimension $P\times P$ (here $P$ is the dimension of the design matrix), which are both independent of $N$ (though, technically, the matrix $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is just a constant under our assumptions). So in these examples, the sufficient statistic has a fixed number of values (not fixed values), or as I would put it, fixed dimension.
Let's note three more things. First, that there is no such thing as the sufficient statistic for a distribution, rather, there are many possible statistics which may be sufficient, and which may be of different dimension. Indeed, our second thing to discuss is that the entire sample itself, since it naturally contains all information contained in itself, is always a sufficient statistic. This is a trivial case, but an important one, as in general one cannot always expect to find a sufficient statistic of dimension less than $N$. And the final thing to note is model specificity: that's why I wrote with respect to a given model above. Changing your likelihood will change the sufficient statistics, at least potentially, for a given dataset.
Regarding B: What you're saying is correct, but additionally to allowing analytic posteriors in the univariate case, conjugacy has serious benefits in the context of Bayesian hierarchical models estimated via MCMC. This is because conditional posteriors are also available in closed form. So we can actually accelerate Metropolis-within-Gibbs style MCMC algos with conjugacy.
Regarding C: It's definitely a similar idea, but I do want to make clear that we're talking about two different distributions here: "posterior" versus "posterior predictive". As the name implies, both of these are posterior distributions, which means that they are distributions of an unknown variable conditioned on our known data. A "posterior" plain and simple usually refers to something like $P(\mu, \sigma^2| \{x_1, \ldots, x_N\})$ from our normal example above: a distribution of unkown parameters defined in the data generating distributions. In contrast, a "posterior predictive" gives the distribution of a hypothetical $N+1$'st data point $x_{N+1}$ conditional on the observed data: $P(x_{N+1}| \{x_1, \ldots, x_N\})$. Notice that this is not conditional on the parameters $\mu$ and $\sigma^2$: they had to be integrated out. It is this additional integral that is guaranteed by conjugacy.
Regarding D: In the context of Variational Bayes (VB), you have some posterior distribution $P(\theta|X)$ where $\theta$ is some vector of $P$ parameters and $X$ are some data. Rather than trying to generate a sample from it like MCMC, we are instead going to use an approximate posterior distribution that's easy to work with and pretty close to the true one. That's called a variational distribution and is denoted $Q_\eta(\theta)$. Notice that our variational distribution is indexed by variational parameters $\eta$. Variational parameters are nothing like the parameters we do Bayesian inference on, and are nothing like our data. They don't have a distribution associated with them and they don't have some hypothetical role generating the data. Rather, they are chosen as a result of an iterative optimization algorithm. The whole idea of variational inference is to define some measure of dissimilarity between the variational distribution and the true posterior and then minimize that measure with respect to the parameters $\eta$. We'll denote the result of that optimization process by $\hat{\eta}(X)$. At that point, hopefully $Q_{\hat{\eta}(X)}(\theta)$ is pretty close to $P(\theta|X)$, and if we do inferences using $Q_{\hat{\eta}(X)}(\theta)$ instead we'll get similar answers.
Now where does conjugacy fit in? A popular way to measure dissimilarity is this measure, which is called the reverse KL cost:
$$ \hat{\eta}(X) := \underset{\eta}{\textrm{argmin}}\, \mathbb{E}_{\theta\sim Q_\eta}\bigg[\frac{\log Q_{\eta}(\theta)}{\log P(\theta|X)}\bigg] $$
This integral cannot be solved in terms of simple functions in general. However, it is available in closed form when:
We use a conjugate prior to define $P(\theta|X)$.
We assume that variational distribution is independent, so in other words that $q_\eta(\theta)=\prod_{j=1}^P q_{j,\eta}(\theta_j)$.
We further restrict ourselves to a particular $q_{j,\eta_j}$ for each $j$ (which is determined by the likelihood).
So it's not that the variational posterior is available in closed form. Rather, it's that the cost function which defines the variational posterior is available in closed form. The cost function being closed form makes computing the variational distribution an easier optimization problem, since we can analytically compute function values and gradients.
Best Answer
As you seem to have (almost) guessed, the trucated distribution comes about from imposing the restriction on the support and then multiplying by a scaling constant to make the restricted density integrate/sum to one. That is all we are doing when we create a truncated version of an initial distribution.
As to when this is useful, it is useful anytime we want to condition on a restricted range for the observable random variable. This occurs in conditional probability problems when we specify an initial distribution and then condition on the value being in some restricted part of the allowable range. It also occurs in cases where we use an approximating distribution to approximate another distribution on a smaller support. Finally, it also occurs in problems with censored data, when we condition on the non-censored part of the data range.