Lots of questions here. I think I've had a crack at all of them, but let me know if I've missed something, or if I've made a mistake somewhere. I might be light on detail in a lot of places; apologies in advance.
First: Why is proving the result for functions on $[0,1]$ sufficient?
Given that we have the result for functions on $[0,1]$, we can define the function $$g(x) = f(a + (b-a)x),$$ and apply the result to $g$, which is a function on $[0,1]$. This gives us a sequence of polynomials $\{ P_n \}$ on $[0,1]$. Given that
$$ f(x) = g\left(\frac{x-a}{b-a}\right), $$
and $P_n\left(\frac{x-a}{b-a}\right)$ is also polynomial, we have the desired sequence of polynomials approximating $f$.
Second: Why is $f$ uniformly continuous?
$f$ is a continuous function on a compact set, and is thus uniformly continuous on $[0,1]$. It is also constant outside of $[0,1]$, so there is nothing going on on the real line that "breaks" uniform continuity. Showing this rigorously entails a simple application of the definition of uniform continuity once you've established by the argument above that $f$ is uniformly continuous on $[0,1]$.
Alternatively, you can view this as a consequence of the pasting lemma for uniformly continuous functions.
Third: Is the fact that $Q_n \to 0$ uniformly on $\delta \le \lvert x \rvert \le 1$ needed in the proof?
You're right, this specific fact seems to play no role in the proof. That is, once we have the bound that $Q_n (x) \le \sqrt n \left( 1- \delta^2 \right)^n $ for $\delta \le \lvert x \rvert \le 1$, we have no need for the uniform convergence of $Q_n$. This is but natural, since the aforementioned bound is what implies uniform convergence.
Of course, one can still prove the theorem (in particular, the second to the last inequality) by dispensing with the specific bound and using the fact that $Q_n \to 0$ uniformly, but that seems to require extra notation, if not some work to set it up.
Fourth: How do you demonstrate that $ \int_0^1 f(t) Q_n (t -x) \, \mathrm dt $ is a polynomial in $x$?
Observe that
$$ \int_0^1 f(t) Q_n (t -x) \, \mathrm dt = \int_0^1 f(t) c_n \left(1 - (t-x)^2\right)^n \, \mathrm d t. $$
Standard manipulations (like, say, the binomial theorem) allow you to expand the expression in the integral. Any terms involving $t$ are integrated out, and all you're left with is a function of $x$ that is (hopefully) clearly a polynomial.
Finally: Should $\delta \in (0,1)$?
I think Rudin snuck this assumption in without making it clear. See, for example, $(50)$. Note also the condition that $\lvert y-x \rvert < \delta$ seems to suggest that $\delta \le 1$, since $x,y \in [0,1]$. In any case, you actually do need $\delta \in (0,1)$ for the last inequality to work for large $n$.
Best Answer
It wouldn't change much, since $$(1-x^4)^n \geq 1-nx^4$$ by the same reason (ie, the derivative of $f(x)=(1-x^4)^n-1+nx^4$ is $-4nx^3(1-x^4)^{n-1}+4nx^3$, which is positive on $(0,1)$) and then you can integrate up to $1/\sqrt[4]{n}$ and get a similar analysis for $c_n$. Then, the only times that we see $(1-x^2)^n$ appearing again is in justifying the uniform convergence of $Q_n$ outside any small interval around $0$, which will hold by essentially the same argument, and in the last inequality, which will not make a difference.
However, there is more to it than "it wouldn't change anything". The idea of the proof by Rudin is that the convolution of two functions should preserve the best properties out of the two functions. One good example of this is the convolution with a smooth compactly supported function, which is a good way to prove uniform density of the smooth functions on $C^0(I)$, for example. So he strikes for a convolution with a polynomial.
The point is: if $f$ is a function (continuous, $L^1_{loc}$ or whatever depending on context), we expect that after making a convolution with a function $\phi$ which is normalized as to have $\int \phi =1$, if the support is small enough and concentrated near $0$, then the convolution $\widetilde{f}$ is near $f$ (think of a discrete averaging where $\phi$ is a "weight", and the convolution at a point is the average of the function according to that weight centralized on the point. If we put more weight at the point, and less on its near values, then the average shouldn't shake things that much). This is easy to arrange for $C^{\infty}$ functions, since we have bump functions. But here we would like to prove density of polynomials, which are not so malleable. With effect, we can't have a polynomial with compact support and integrating $1$ (we can't have non-trivial polynomials with compact support at all!).
So we try to emulate a compact support by taking a polynomial which is highly concentrated near the origin. The polynomial $p(x)=1-x^2$ is maybe the most trivial example of something concentrated near the origin. An important observation is that since the function is defined on $[0,1]$, what matters to our avering polynomial is its behaviour in $[-1,1]$, so the fact that $(1-x^2)$ explodes outside of $[-1,1]$ is irrelevant. Then we raise it to the power of $n$ since, being on $(-1,1)$, this will concentrate things even more near the origin (you can plot a few powers to try and see this). The estimates that Rudin do are to guarantee that everything goes well and represent the technical difficulty of not having compact support (also as small as desired).
The polynomial $1-x^4$ also satisfies the property that if we raise it to $n$, things will concentrate near the origin. The only difference is that it is less concentrated than $1-x^2$, so that the error you make with the convolution will probably be bigger than if you used $1-x^2$.