I'm not an expert in this area, but this may be a start.
Rather than $\prod_{p\lt n}$, you can use $\prod_{p\le \sqrt n}$.
$\log\log \sqrt n + \gamma \lt \log\log \sqrt n +\log 2 = \log\log n $
That gets you a little closer, since now you are off by $\log 2 - \gamma \approx 0.116$.
The heuristic probability that $n$ is prime is not
$$\prod_{p\lt n} (1-Pr(p|n))$$
It is the product of probabilities
$$\prod_{p\lt n} (1-Pr(p|n \text{ given no smaller prime divides } n))$$
For $p$ small, the term you get may be close to $(1-1/p)$, but I that's not the case for $p$ large.
For $\sqrt n \lt p$, the term corresponding to $p$ in the product is just $1$.
For $\sqrt[3]n \lt p \le \sqrt n$, if $p$ is the smallest prime dividing $n$, then $n/p$ must be prime, too. Perhaps that means that by strong induction, we should discount these terms by the probability $n/p$ is prime, about $1/\log \frac np$, so that those terms in the product are $(1-1/(p \log \frac np))$.
It looks like you get some sums/integrals if you try to extend this to more terms, and I don't know whether you can expect to get the desired accuracy at the end.
The proof given in Iwaniec-Kowalski is, as it stands, wrong. It can be easily fixed, as I explain below.
In general, one can think of $\nu(n)^2$ as the characteristic function of integers with $P^-(n):=\min(p|n)\ge y$. So
$$
\sum_{n\le x} \frac{\nu(n)^2 f(n)}{n} \approx \sum_{ n\le x,\ P^-(n)\ge y } \frac{f(n)}{n} \asymp \prod_{y\le p\le x} \left(1+\frac{f(p)}{p}\right)
$$
for any reasonable multiplicative function $f$ that is bounded on primes. However, an important restriction is that $f(p)<\kappa$ on average, where $\kappa$ is the sifting dimension. In this case the dimension is 1, whereas $f(p)=8$. So the sum in question, say $S$, does not satisfy a priori the claimed bound. In fact, in this case $S\gg(\log x/\log y)^8$:
If $P^+(n)=\max(p|n)$, then we have that
$$
S= \sum_{n\le x}\frac{\nu(n)^2\tau(n)^3}{n}
\asymp \sum_{P^+(n)\le x}\frac{\nu(n)^2\tau(n)^3}{n}
$$
(this step is heuristic and employed for simplicity). If we let $\sigma_m=\sum_{[d_1,d_2]=m}\theta_{d_1}\theta_{d_2}$ and we open $\nu(m)^2$, we have that
$$
S \approx \sum_{P^+(m)\le x} \sigma_m \sum_{P^+(n)\le x,\ m|n} \frac{\tau(n)^3}{n}
= P(x) \sum_{P^+(m)\le x} \frac{\sigma_m g(m)}{m},
$$
where $g(m)$ is a multiplicative function with $g(p)=8+O(1/p)$ and $P(x)=\prod_{p\le x}(1+\tau(p)^3/p+\tau(p^2)^3/p^2+\cdots)\asymp(\log x)^8$. Hence
$$
S/P(x)\approx \sum_{P^+(d_1d_2)\le x} \frac{\theta_{d_1}\theta_{d_2}g([d_1,d_2])}{[d_1,d_2]}
= \sum_{P^+(d_1d_2)\le x} \frac{\theta_{d_1}\theta_{d_2}g(d_1)g(d_2)}{d_1d_2} \frac{(d_1,d_2)}{g((d_1,d_2))}.
$$
Writing $f(m)=\prod_{p|m}(1-g(p)/p)$ so that $m/g(m)=\sum_{n|m}f(n)n/g(n)$, we deduce that
$$
S/P(x)\approx \sum_{P^+(n)\le x}\frac{f(n)n}{g(n)} \left( \sum_{P^+(d)\le x,\ n|d} \frac{\theta_{d}g(d)}{d}\right)^2.
$$
When $y/2 \lt n\le y$, we have that
$$
\sum_{P^+(d)\le x,\ n|d} \frac{\theta_{d}g(d)}{d}
= \frac{\theta_n g(n)}{n} = \frac{\mu(n)g(n)}{G} \sum_{k\le y,\ (k,q)=1,\ n|k} \frac{\mu^2(k)}{\phi(k)}
= \frac{\mu(n)g(n)}{G} \cdot \frac{\textbf{1}_{(n,q)=1}}{\phi(n)}
$$
(note that there is an error in the definition of $\theta_b$ in Iwaniec-Kowalski, as one has to restrict them on those $b$ which are coprime to $q$. In fact, $\theta_b=(\mu(b)b/G) \sum_{k\le y,\ (k,q)=1,\ b|k}\mu^2(k)/\phi(k)$). So
$$
S \gtrsim \frac{P(x)}{G^2} \sum_{y/2 \lt n\le y,\ (n,q)=1} \frac{\mu^2(n) f(n)g(n)n}{\phi(n)^2}
\gg_q (\log x)^8(\log y)^5,
$$
by the Selberg-Delange method. This is certainly bigger than what we would need.
In order to remedy this deficiency, one would have to choose $\nu(n)$ in another way, as weights from a sieve of dimension 8 or higher. The easiest choice to work with is, as GH also points out, is to define $\nu$ via the relation $\nu(n)^2=(1*\lambda)(n)$, where $(\lambda(d):d\le D)$ is a $\beta$ upper bound sieve of level $y$ and dimension 8, so that
$$
S = \sum_{n\le x} \frac{(1*\lambda)(n)\tau(n)^3}{n}.
$$
The point is that the sequence $(\lambda(d))_{d\le D}$ satisfies the Fundamental Lemma (Lemma 6.3 in Iwaniec-Kowalski) in the following strong sense:
(1) $\lambda(d)$ is supported on square-free integers with $P^+(d)\le y$
(2) whenever $\{a(d)\}_{d\ge1}$ is a sequence such that
$$
\bigg|\sum_{P^+(d)\le z}\frac{\mu(d)a(dm)}{d}\bigg|
\le Ag(m)\prod_{y_0\le p\le z}(1-g(p)/p)
\quad\text{when}\ P^-(m)>z,
$$
where $A\ge0$ and $y_0\ge1$ are some parameters and $g\ge0$ is multiplicative with
$$
\prod_{w\le p\le w'} (1-g(p)/p)^{-1} \le \left(\frac{\log w'}{\log w}\right)^8\left(1+\frac{K}{\log w}\right)\qquad(w'\ge w\ge y_0),
$$
we have
$$
\sum_{d\le D} \frac{\lambda(d)a(d)}{d} = \sum_{P^+(d)\le y} \frac{\mu(d)a(d)}{d}
+ O_K\left( Ae^{-u}\prod_{y_0\le p\le y}\left(1-\frac{g(p)}{p}\right) \right),
$$
where $u=\log D/\log y$.
Remark: if $a=g$, then the condition on $a$ holds trivially with $A=y_0=1$. Hence, the above statement is indeed a generalization of the usual Fundamental Lemma.
We have that
$$
S = \sum_{d\le D} \frac{\lambda(d)a(d)}{d},
$$
where
$$
a(d) = \sum_{m\le x/d} \frac{\tau(dm)^3}{m}.
$$
If $P^-(m)>z$, then
\begin{align*}
\sum_{P^+(d)\le z} \frac{\mu(d)a(dm)}{d}
&= \sum_{P^+(d)\le z} \frac{\mu(d)}{d}\sum_{k\le x/(dm)}\frac{\tau(dkm)^3}{k} \\
&= \sum_{n\le x/m} \frac{\tau(nm)^3}{n} \sum_{dk=n,\,P^+(d)\le z}\mu(d).
\end{align*}
By M\"obius inversion, we then conclude that
\begin{align*}
\sum_{P^+(d)\le z} \frac{\mu(d)\tau(d)^3a(dm)}{d}
&= \sum_{n\le x/m,\,P^-(n)>z} \frac{\tau(nm)^3}{n} \\
&\ll \tau(m)^3 (\log x/\log z)^8 \\
&\asymp (\log x)^8\tau(m)^3\prod_{11\le p\le z}(1-\tau(p)^3/p),
\end{align*}
so that the second axiom holds for $a$ with $A\asymp(\log x)^8$, $y_0=11$ and $g=\tau^3$. We thus conclude that
\begin{align*}
S = \sum_{d\le D} \frac{\lambda(d)a(d)}{d}
&= \sum_{P^+(d)\le y} \frac{\mu(d)a(d)}{d}
+ O\left( e^{-\log x/\log y} \left(\frac{\log x}{\log y}\right)^8\right) \newline
&= \sum_{n\le x,\ P^-(n)>y} \frac{\tau(n)^3 }{n}
+ O\left( e^{-\log x/\log y} \left(\frac{\log x}{\log y}\right)^8\right) \newline
&\ll \left(\frac{\log x}{\log y}\right)^8.
\end{align*}
A more general remark: Selberg's sieve is not as flexible as the $\beta$-sieve as far as ``preliminary sieving'' is concerned because it carries inside it the sieve problem it is applied to, in contrast to the $\beta$-sieve weights that only depend on the sifting dimension via the $\beta$ parameter.
EDIT: the monotonicity principle II described in page 49 of "Opera de Cribro" by Friedlander-Iwaniec is a way to use Selberg's sieve weights for preliminary sieving. See also Proposition 7.2 in the same book.
Best Answer
The complex-analytic proof of the prime number theorem can help inform the elementary one.
The von Mangoldt function $\Lambda$ is related to the Riemann zeta function $\zeta$ by the formula $$ \sum_n \frac{\Lambda(n)}{n^s} = -\frac{\zeta'(s)}{\zeta(s)},$$ at least in the region $\mathrm{Re}(s) > 1$. The right-hand side has a pole of residue 1 at the simple pole $s=1$ of $\zeta$, and a pole of residue -1 at each zero $s=\rho$ of $\zeta$. Applying Perron's formula carefully, one eventually arrives at the von Mangoldt explicit formula
$$ \sum_{n \leq x} \Lambda(n) = x - \sum_\rho \frac{x^\rho}{\rho} + \dots$$ where the remaining terms $\dots$ (as well as the way in which the infinite sum over zeroes is interpreted) will be ignored for this discussion. This formula strongly suggests that zeroes $\rho$ with real part strictly less than one will eventually only make a lower order contribution to $\sum_{n \leq x} \Lambda(n)$, whereas zeroes $\rho$ with real part equal to one would likely destroy the prime number theorem. In fact one can pursue this line of reasoning more carefully and show that the prime number theorem is equivalent to the absence of zeroes of zeta on the line $\mathrm{Re}(s)=1$.
But what if there was a way to make the contribution of the zeroes of zeta smaller than those of the pole, even when said zeroes lie on the line $\mathrm{Re}(s)=1$? One way to do this is to replace the log-derivative $-\frac{\zeta'(s)}{\zeta(s)}$ by the variant $\frac{\zeta''(s)}{\zeta(s)}$. Assuming for simplicity that all zeroes are simple, this function still has a simple pole at every zero $\rho$ (with residue $\frac{\zeta''(\rho)}{\zeta'(\rho)}$), but now has a double pole at $s=1$ (with residue $2$). As it turns out, this variant also has a Dirichlet series representation $$ \sum_n \frac{ \Lambda_2(n) }{n^s} = \frac{\zeta''(s)}{\zeta(s)}$$ which by Perron's formula suggests an asymptotic of the form $$ \sum_{n \leq x} \Lambda_2(n) = 2 x \log x + \sum_\rho \frac{\zeta''(\rho)}{\zeta'(\rho)} \frac{x^\rho}{\rho} + \dots$$ Now, even if there are zeroes on $\mathrm{Re}(s)=1$, their contribution should still be smaller than the main term, and one may now be led to predict an asymptotic of the form $$ \sum_{n \leq x} \Lambda_2(n) = 2 x \log x + O(x) \quad (4)$$ and furthermore this asymptotic should be easier to prove than the prime number theorem, as it is not reliant on preventing zeroes at the edge of the critical strip.
The functions $\Lambda_2$ and $\Lambda$ are related via the identity $$ \frac{\zeta''(s)}{\zeta(s)} = -\frac{d}{ds} (-\frac{\zeta'(s)}{\zeta(s)}) + (-\frac{\zeta'(s)}{\zeta(s)})^2$$ which after inverting the Dirichlet series leads to $$ \Lambda_2(n) = \Lambda(n) \log n + \sum_{d|n} \Lambda(d) \Lambda(\frac{n}{d}) \quad (5)$$ and from this and (4) we soon obtain (2) and hence (1).
One can also think about this from a "music of the primes" standpoint. The explicit formula is morally resolving $\Lambda$ into "harmonics" as $$ \Lambda(n) \approx 1 - \sum_\rho n^{\rho-1} + \dots \quad (6)$$ where I am deliberately being vague as to how to interpret the symbols $\approx$, $\sum$, and $\dots$. The Dirichlet convolution of $1$ with itself resembles $\log n$, while the Dirichlet convolution of $-n^{\rho-1}$ with itself resembles $n^{\rho-1} \log n$; also, cross-terms such as the convolution of $1$ with $-n^{\rho-1}$ experience significant cancellation and should be lower order. Thus the right-hand side of (5) amplifies the "1" harmonic of $\Lambda$ by a factor of about $2 \log n$, while damping out the contributions of each of the zeroes $\rho$ due to cancellation between the two terms, leading one to (1) or (2) as a smoothed out version of the prime number theorem. Conversely, (2) combined with (5) can be viewed as an elementary way to encode (the largest terms of) the explicit formula (4), without explicit mention of zeroes.
The most difficult part of the Erdos-Selberg elementary proof of the PNT is the passage from (1) to the prime number theorem, as here we must actually eliminate the scenario in which there is a zero on $\mathrm{Re}(s)=1$. The key to doing so is to exploit the non-negativity of $\Lambda$; morally, the expansion (6) prevents $\rho$ (and hence also $\overline{\rho}$) from lying on $\mathrm{Re}(s)=1$ as this would make the right-hand side RHS negative "on average" for certain sets of $n$. There are several ways to make this intuition precise; the standard complex-analytic proofs often proceed by testing $\Lambda$ against $n^{\pm it}$ and $n^{\pm 2it}$, where $1+it$ is a hypothetical zero of $\zeta$. Very roughly speaking, the idea of the elementary proof is to exploit not only the non-negativity of $\Lambda$, but also upper bounds coming from sieve-theoretic bounds such as the Brun-Titchmarsh inequality, which roughly speaking tell us that $\Lambda$ behaves "on average" as if it lies between $0$ and $2$. (Actually, the Selberg formula (2) already gives an adequate substitute for this inequality for the purposes of this argument.) On the other hand, from (6) we see that if there was a zero $\rho = 1+it$ on the line $\mathrm{Re}(s)=1$, then $\Lambda-1$ should have an extremely large correlation with $\mathrm{Re} n^{\rho-1} = \cos(t \log n)$, and these two facts can be placed in contradiction with each other, basically because the average value of $|\cos(t \log n)|$ is strictly less than one. However, because the zeroes $\rho$ are not explicitly present in the Selberg formula (2), one has to work quite a bit in the elementary proof to make this argument rigorous. In this blog post of mine I use some Banach algebra theory to tease out the zeroes $\rho=1+it$ from the Selberg formula (they can be interpreted as the "Gelfand spectrum" of a certain Banach algebra norm associated to that formula) to make this connection more explicit.