This is standard stuff from variational inference, which it might be helpful to look into more in-depth as it appears more often than just in VAEs (e.g., in variational Bayesian NNs). Luckily we can get it all fairly simply with some easy probability theory.
The idea is that the true posterior $p_\theta(z|x_i)$ is too difficult to compute directly (i.e., via Bayes rule), so we instead approximate it with $q_\phi(z|x_i)$. We then optimize $\phi$ instead of working with probabilities directly. Mathematically we have that
$$ p_\theta(z|x) = \frac{p_\theta(x|z) p_\theta(z)}{p_\theta(x)} = \frac{p_\theta(x|z) p_\theta(z)}{\int p_\theta(x|z) p_\theta(z) dz} \approx q_\phi(z|x) $$
It turns out that the log-marginal likelihood can be expressed via:
$$
\log p_\theta(x) = \mathcal{D}_\text{KL}\left[ q_\phi(z|x)\mid\mid p_\theta(z|x) \right] + \mathcal{L}(\theta,\phi\mid x) \tag{1}
$$
where $\mathcal{L}$ is the evidence lower bound (ELBO), which may be written
\begin{align}
\mathcal{L}(\theta,\phi\mid x)
&= \mathbb{E}_{q_\phi(z|x)}\left[ -\log q_\phi(z|x) + \log p_\theta(x,z) \right] \tag{a} \\
&= \int q_\phi(z|x)\left[ -\log q_\phi(z|x) + \log p_\theta(x|z) + \log p_\theta(z) \right]dz \\
&= \int q_\phi(z|x) \log\left(\frac{p_\theta(z)}{q_\phi(z|x)}\right) dz
+ \int q_\phi(z|x) \log p_\theta(x|z)\, dz \\
&=
-\mathcal{D}_\text{KL}\left[ q_\phi(z|x) \mid\mid p_\theta(z) \right]
+
\mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x|z) \right]
\tag{b}
\end{align}
where (a) is the form in eq (2) in the paper and (b) is the form in eq (3) in the paper. We used the fact that
$$ \mathcal{D}_\text{KL}\left[ p_\xi(y) \mid\mid p_\eta(y) \right] = \int \log\left(\frac{p_\xi(y)}{p_\eta(y)}\right) p_\xi(y) \,dy = \mathbb{E}_{p_\xi(y)}\left[ \log\frac{p_\xi(y)}{p_\eta(y)} \right] $$
Ok, but we still need to derive equation (1):
\begin{align}
\log p_\theta(x)
&= \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x) \right] \\
&= \mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{p_\theta(x,z)}{p_\theta(z|x)} \frac{q_\phi(z|x)}{q_\phi(z|x)} \right) \right] \\
&= \underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{q_\phi(z|x)}{p_\theta(z|x)} \right) \right] }_{ \mathcal{D}_\text{KL}[q_\phi(z|x)\mid\mid p_\theta(z|x)]}
+ \underbrace{
\mathbb{E}_{q_\phi(z|x)}\left[ \log\left( \frac{p_\theta(x,z)}{q_\phi(z|x)} \right) \right]}_{\text{Exactly (a) above}} \\
&= \mathcal{D}_\text{KL}[q_\phi(z|x)\mid\mid p_\theta(z|x)]
+ \mathcal{L}(\theta,\phi\mid x)
\end{align}
where the first step (taking the expectation) used the fact that there is no dependence on $z$ in the marginal and the second used the identity $p_\theta(x,z) = p_\theta(z|x) p_\theta(x)$. This gives us equation (1) from the paper.
Next we derive the "lower-bounding" part (equation 2 in the paper).
Firstly, though, I want to say that because the KL-divergence is non-negative, we must immediately have equation (2),
$$ \log p_\theta(x) \geq \mathcal{L}(\theta,\phi\mid x), $$
trivially from equation (1).
However, it's more common to see people get to (2) as follows.
Recall Jensen's inequality, which states that for a concave function $f$ (like log), we get that $f(\mathbb{E}[y]) \geq \mathbb{E}[f(y)]$, so we see that
$$ \log p(x)
= \log\int p(x,y) \frac{q(y)}{q(y)} dy
= \log \mathbb{E}_{q(y)}\left[ \frac{p(x,y)}{q(y)} \right]
\geq \mathbb{E}_{q(y)}\left[ \log\left( \frac{p(x,y)}{q(y)} \right) \right]
\tag{JI}
$$
Using this, we get
\begin{align}
\log p_\theta(x) &= \log \int p_\theta(x,z) dz \\
&= \log\int q_\phi(z|x)\frac{p_\theta(x,z)}{q_\phi(z|x) } dz \\
&= \log \mathbb{E}_{q_\phi(z|x)}\left[\frac{p_\theta(x,z)}{q_\phi(z|x) }\right] \\
&\geq \mathbb{E}_{q_\phi(z|x)}\left[ \log \left(\frac{p_\theta(x,z)}{q_\phi(z|x) }\right)\right] \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{From (JI)}\\
&= \mathbb{E}_{q_\phi(z|x)}\left[ \log p_\theta(x,z) - \log q_\phi(z|x) \right]
\end{align}
which is exactly equation (a).
As to your comments:
How is this the same as marginal likelihood. I've been looking at this equation for quite some time and I can't reason through it like I can with standard marginal likelihood.
As noted in the derivation, it can be interpreted as approximating the true posterior with a variational distribution. The reasoning is then that we decompose into two terms: one basically optimizes the likelihood using the approximate distribution and the other forces the approximation to match the true posterior.
The only thing I have been able to deduce from it (I think) is that the two terms on the RHS of the equation will be negatively correlated because as the divergence gets smaller, the likelihood of them both goes up and vice versa, correct?
I am not sure if that is the case. I think it is possible to choose $q_\phi$ such that the first term gets much worse but the second improves. But I am not sure on that.
[For the second equation,] Why is this equal to the joint likelihood of 𝜃 and 𝜙? Why is the input to $p_\theta$ different (joint) than that of $q_\phi$ which is conditional?
Well perhaps the derivation cleared this up for you, but $x$ is fixed beforehand on both sides of the equation, whereas the expectation is taken over $z\sim q_\phi(z|x)$ on the RHS. So both $x$ and $z$ are well-defined.
One thing to notice is that since the KL-divergence is non-negative, the ELBO $\mathcal{L}$ is always a lower bound on the log-marginal-likelihood. Maximizing it guarantees we also "push up" the true marginal log-likelihood. Another thing is that when the variational approximate posterior ($q_\phi$) is perfect (meaning it matches the true posterior exactly, so the KL vanishes), we get that $\log p_\theta(x) = \mathcal{L}(\theta,\phi\mid x)$, so that the ELBO exactly is the log-marginal likelihood. In other words, we are simultaneously (1) optimizing $q_\phi$ to match the true posterior and (2) optimizing the marginal likelihood. It happens that as you do (1) (improving $q_\phi$), (2) will also become easier (as the ELBO $\mathcal{L}$ will approach the true marginal likelihood)!
The even more common interpretation comes from equation (b), which states that we should choose a latent prior distribution $p_\theta(z)$, and force our approximate posterior to match it, while also optimizing the conditional likelihood (usually some kind of reconstruction error). Hence we get the VAE as a kind of regularized stochastic autoencoder.
Best Answer
I think what's happening is $x_i\in\mathbb{R}^D \;\forall i$ defines the data points , $t_i\in \mathbb{R}$ is the target for $x_i$, the basis functions are $\phi_j:\mathbb{R}^D\rightarrow\mathbb{R}$ (so that $\phi : \mathbb{R}^D\rightarrow \mathbb{R}^M$), and the weights are given by $w\in\mathbb{R}^M$. The data generating mechanism is $$ t = y(w,x) + \epsilon = \epsilon + \sum_{\ell=0}^{M-1} w^T\phi(x) $$ meaning $t\sim\mathcal{N}(t|y(x,w),\beta^{-1})$. See equation 3.3 in the book.
Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,\beta)$), and computes the gradient wrt to the weights. We should see that $\nabla\ln p(t|w,\beta)\in\mathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X=\{x_1,\ldots,x_N\}$ and $t=\{t_1,\ldots,t_N\}$. We get: $$ \nabla\mathcal{E}:=\nabla \ln p(t|w,\beta) =\sum_{n=1}^N \underbrace{\left[ t_n - w^T\phi(x_n) \right]}_{\mathbb{R}}\underbrace{\phi(x_n)^T}_{\mathbb{R}^M}\in\mathbb{R}^M $$ since summing over the vectors doesn't change their dimensionality. Indeed, $\nabla\mathcal{E}\in\mathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $\nabla\mathcal{E}$.
Your question:
Note that $\phi_n$ is nowhere in the above equation. There is only $\phi(x_n)$, which is $M$ dimensional (mapping the $x_n\in\mathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.
Edit for comments:
In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).
However, it is slightly more common, I feel, to consider it a row vector, because:
There is plenty of debate on this though; e.g., [1], [2], [3], [4], [5], [6], [7] [8].
(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)
I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?
TL;DR: in ML, the literature is very lenient regarding $\mathbb{R}^{1\times M}$ vs $\mathbb{R}^{M\times 1}$ vs $\mathbb{R}^{M}$, so I wouldn't worry about it.