The likelihood in a parametric model is the joint density (or mass function) of the observed data conditional on any parameters in the model (which are treated as random in a Bayes context).
So using $f$ to denote a density or mass function, and letting $x$ denote the joint $x_i$ data, $y$ denote the joint $y_i$ data, and $\theta$ denote parameters, I would write the likelihood as
$$f(y,x|\theta)=f(y|x,\theta)f(x|\theta).$$
Note that if the density of $x$ doesn't depend on $\theta$, maximizing likelihood over the parameter space is equivalent to maximizing $f(y|x,\theta),$ since the density of $x$ just scales this up.
As far as using Bayes' rule, we have
$$f(\theta|y,x)=\frac{f(y,x|\theta)f(\theta)}{f(y,x)}=\frac{f(y|x,\theta)f(\theta|x)f(x)}{f(y,x)},$$
and often in a Bayesian context you will see this is written up to proportionality since other terms are simply normalizing constants (since $\theta$ is the random variable of interest):
$$f(\theta|y,x)\propto f(y,x|\theta)f(\theta)\propto f(y|x,\theta)f(\theta|x).$$
If you are confused at any point, just write out everything in terms of joint and marginal densities or masses to convince yourself of the correct identity. Hopefully that helps.
I am quite certain that the formula on the grey background is false. This would mean that the log-likelihood of a single sample $x$ (i.e. the log of the density of the mixture of two gaussians) is a polynomial in $x$ of degree two, i.e. that the mixture of two gaussians is a gaussian. This is obviously false, see
https://www.wolframalpha.com/input?i=plot+ln%28%281%2F2%29+e%5E%28-%28x%2B3%2F2%29%5E2%29+%2B+%281%2F2%29+e%5E%28-%28x-3%2F2%29%5E2%29%29+for+x%3D-4..4
Note that the Expectation-Maximization (EM) algorithm – which is where the formula in the grey background seems to come from, as is somewhat visible in the CrossValidated post – avoids this (and keeps simple formulae) by first estimating from which gaussian the sample was sampled before estimating the log-likelihood of the sample together with the number of the gaussian it was sampled from. Let us try to find out what this means.
To find a formula similar to what you are looking for, define first $z_1, \dots, z_n$ to be the number of the gaussian from which you sampled $x_1, \dots, x_n$. Note that $z_i$ is not present in the data.
The log-likelihood of the joint variable $(z_1,x_1)$ is
\begin{align}
\mathcal L\mathcal L(z, x) := \ln w_{z} + \ln p(x | \mu_z, \sigma^2_z) ) = \ln w_z - \ln(\sqrt{2\pi\sigma_z^2}) - \frac{(x-\mu_z)^2}{2\sigma_z^2} . \qquad (1)
\end{align}
On the other hand,
$$ \mathbb P(z_1 = k | x_1) = \frac{w_k p(x_1 | \mu_k, \sigma_k^2)}{\sum_{k’} w_{k’} p(x_1 | \mu_{k’}, \sigma_{k’}^2)} . $$
You are given $x_1, \dots, x_n$ generated according to an unknown mixture of gaussians. If, somehow, you knew $\mathbb P(z_1 = k | x_1)$ exactly, then you could generate the $z_i$ to obtain a sample $(z_i, x_i)$ that has the same distribution as the original data. Unfortunately, $\mathbb P(z_1 = k | x_1)$ is unknown because the $\mu_i, \sigma_i^2$ are unknown.
Here comes the trick: the EM algorithm assumes you have pretty good estimators for the $\mu_k, \sigma_k^2$, so you have a pretty good estimate of $\mathbb P(z_1 = k | x_1)$. So you can generate $z_i$ in this manner, then compute the log-likelihood of the joint series $(z_i, x_i)$ which has a very simple expression $(1)$.
In fact, sampling the $z_i$ is not needed: you can compute the expected log-likelihood you would obtain in this manner (conditionally on the $x_i$), which is given by
$$ \sum_{k_1, \dots, k_n} \sum_i \mathcal L\mathcal L(k_i, x_i) \mathbb P(z_i = k_i | x_i) . $$
This is where every formula with the sum outside the log ([2] in the CrossValidated post) come from, and as you can see, they do not actually give the true log-likelihood.
Note 1: There is a slight difference in [2], where $\ln(\sqrt{2\pi\sigma_z^2})$ is replaced by $\frac{1}{2}\ln(\sigma_z^2)$. This is okay because we do not care about additive constants.
Note 2: To be convinced of just how sketchy our reasoning was, be aware that the EM algorithm, which we derived to try and find the maximum likelihood estimator (MLE) via a couple of heuristics, might fail to converge to the MLE, see these lecture notes.
Best Answer
This is standard stuff from variational inference, which it might be helpful to look into more in-depth as it appears more often than just in VAEs (e.g., in variational Bayesian NNs). Luckily we can get it all fairly simply with some easy probability theory.
The idea is that the true posterior $p_\theta(z|x_i)$ is too difficult to compute directly (i.e., via Bayes rule), so we instead approximate it with $q_\phi(z|x_i)$. We then optimize $\phi$ instead of working with probabilities directly. Mathematically we have that $$ p_\theta(z|x) = \frac{p_\theta(x|z) p_\theta(z)}{p_\theta(x)} = \frac{p_\theta(x|z) p_\theta(z)}{\int p_\theta(x|z) p_\theta(z) dz} \approx q_\phi(z|x) $$ It turns out that the log-marginal likelihood can be expressed via: $$ \log p_\theta(x) = \mathcal{D}_\text{KL}\left[ q_\phi(z|x)\mid\mid p_\theta(z|x) \right] + \mathcal{L}(\theta,\phi\mid x) \tag{1} $$ where $\mathcal{L}$ is the evidence lower bound (ELBO), which may be written \begin{align} \mathcal{L}(\theta,\phi\mid x) &= \mathbb{E}_{q_\phi(z|x)}\left[ -\log q_\phi(z|x) + \log p_\theta(x,z) \right] \tag{a} \\ &= \int q_\phi(z|x)\left[ -\log q_\phi(z|x) + \log p_\theta(x|z) + \log p_\theta(z) \right]dz \\ &= \int q_\phi(z|x) \log\left(\frac{p_\theta(z)}{q_\phi(z|x)}\right) dz + \int q_\phi(z|x) \log p_\theta(x|z)\, dz \\ &= -\mathcal{D}_\text{KL}\left[ q_\phi(z|x) \mid\mid p_\theta(z) \right] + \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x|z) \right] \tag{b} \end{align} where (a) is the form in eq (2) in the paper and (b) is the form in eq (3) in the paper. We used the fact that $$ \mathcal{D}_\text{KL}\left[ p_\xi(y) \mid\mid p_\eta(y) \right] = \int \log\left(\frac{p_\xi(y)}{p_\eta(y)}\right) p_\xi(y) \,dy = \mathbb{E}_{p_\xi(y)}\left[ \log\frac{p_\xi(y)}{p_\eta(y)} \right] $$
Ok, but we still need to derive equation (1): \begin{align} \log p_\theta(x) &= \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x) \right] \\ &= \mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{p_\theta(x,z)}{p_\theta(z|x)} \frac{q_\phi(z|x)}{q_\phi(z|x)} \right) \right] \\ &= \underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{q_\phi(z|x)}{p_\theta(z|x)} \right) \right] }_{ \mathcal{D}_\text{KL}[q_\phi(z|x)\mid\mid p_\theta(z|x)]} + \underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{p_\theta(x,z)}{q_\phi(z|x)} \right) \right]}_{\text{Exactly (a) above}} \\ &= \mathcal{D}_\text{KL}[q_\phi(z|x)\mid\mid p_\theta(z|x)] + \mathcal{L}(\theta,\phi\mid x) \end{align} where the first step (taking the expectation) used the fact that there is no dependence on $z$ in the marginal and the second used the identity $p_\theta(x,z) = p_\theta(z|x) p_\theta(x)$. This gives us equation (1) from the paper.
Next we derive the "lower-bounding" part (equation 2 in the paper).
Firstly, though, I want to say that because the KL-divergence is non-negative, we must immediately have equation (2), $$ \log p_\theta(x) \geq \mathcal{L}(\theta,\phi\mid x), $$ trivially from equation (1).
However, it's more common to see people get to (2) as follows. Recall Jensen's inequality, which states that for a concave function $f$ (like log), we get that $f(\mathbb{E}[y]) \geq \mathbb{E}[f(y)]$, so we see that $$ \log p(x) = \log\int p(x,y) \frac{q(y)}{q(y)} dy = \log \mathbb{E}_{q(y)}\left[ \frac{p(x,y)}{q(y)} \right] \geq \mathbb{E}_{q(y)}\left[ \log\left( \frac{p(x,y)}{q(y)} \right) \right] \tag{JI} $$ Using this, we get \begin{align} \log p_\theta(x) &= \log \int p_\theta(x,z) dz \\ &= \log\int q_\phi(z|x)\frac{p_\theta(x,z)}{q_\phi(z|x) } dz \\ &= \log \mathbb{E}_{q_\phi(z|x)}\left[\frac{p_\theta(x,z)}{q_\phi(z|x) }\right] \\ &\geq \mathbb{E}_{q_\phi(z|x)}\left[ \log \left(\frac{p_\theta(x,z)}{q_\phi(z|x) }\right)\right] \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{From (JI)}\\ &= \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x,z) - \log q_\phi(z|x) \right] \end{align} which is exactly equation (a).
As to your comments:
As noted in the derivation, it can be interpreted as approximating the true posterior with a variational distribution. The reasoning is then that we decompose into two terms: one basically optimizes the likelihood using the approximate distribution and the other forces the approximation to match the true posterior.
I am not sure if that is the case. I think it is possible to choose $q_\phi$ such that the first term gets much worse but the second improves. But I am not sure on that.
Well perhaps the derivation cleared this up for you, but $x$ is fixed beforehand on both sides of the equation, whereas the expectation is taken over $z\sim q_\phi(z|x)$ on the RHS. So both $x$ and $z$ are well-defined.
One thing to notice is that since the KL-divergence is non-negative, the ELBO $\mathcal{L}$ is always a lower bound on the log-marginal-likelihood. Maximizing it guarantees we also "push up" the true marginal log-likelihood. Another thing is that when the variational approximate posterior ($q_\phi$) is perfect (meaning it matches the true posterior exactly, so the KL vanishes), we get that $\log p_\theta(x) = \mathcal{L}(\theta,\phi\mid x)$, so that the ELBO exactly is the log-marginal likelihood. In other words, we are simultaneously (1) optimizing $q_\phi$ to match the true posterior and (2) optimizing the marginal likelihood. It happens that as you do (1) (improving $q_\phi$), (2) will also become easier (as the ELBO $\mathcal{L}$ will approach the true marginal likelihood)!
The even more common interpretation comes from equation (b), which states that we should choose a latent prior distribution $p_\theta(z)$, and force our approximate posterior to match it, while also optimizing the conditional likelihood (usually some kind of reconstruction error). Hence we get the VAE as a kind of regularized stochastic autoencoder.