The key assumption is that the signal is band-limited. This is a frequency-domain assumption. Any sensible proof must go through the frequency domain.
Same goes for the proof being an approximation argument. Any real proof must be an approximation argument, including the one you alluded to, or its rigorous version.
Here is another Hilbert space argument that, in the end, gives us approximation in the topology of uniform convergence (much better than $L^1$ or $L^2$):
Let $H$ be the set of band-limited elements of $L^2(\mathbb{R})$ (no extra assumptions). By unitarity of the Fourier transform $\mathcal{F}$, $H$ is a Hilbert subspace of $L^2$. Since $L^2$ elements of compact support lie in $L^1$, the Fourier inversion theorem implies that elements of $H$ are in fact continuous almost everywhere. So band-limited assumption implies (for now) continuity and therefore sampling make sense.
$\mathcal{F}(H)$ has orthonormal basis $\{ e^{- 2 \pi i k\frac{\xi}{2T}} \}_{k \in \mathbb{Z}}$. So now it's natural to compute the Hilbert space expansion of $\hat{f}$ in this basis then apply $\mathcal{F}^{-1}$. By unitarity
$$
\langle \hat{f}, e^{- 2 \pi i k\frac{\xi}{2T}} \rangle = \langle f, \delta_{\frac{k}{2T}}\rangle = f(\frac{k}{2T}).
$$
Strictly speaking, one needs a rigged Hilbert space that includes distributions to make sense of inner products with delta functions but everything works out. On the other hand, the inverse Fourier transform of the basis $\{ e^{- 2 \pi i k\frac{\xi}{2T}} \}_{k \in \mathbb{Z}}$ are just shifts of the $\mbox{sinc}$ function. So we have that Shannon's sampling formula holds in the $L^2$-sense.
To strengthen the convergence, notice $L^2$-convergence (in the frequency domain) implies $L^1$-convergence by the band-limited assumption. By property of $\mathcal{F}^{-1}$, back in the time domain we have uniform convergence.
Since $\mbox{sinc}$ functions and its shifts are all smooth, we can actually conclude that a band-limited $L^2$ function is in fact smooth almost everywhere.
There's a nice proof of this in Hörmander's book "The Analysis of Linear Partial Differential Operators I" (Theorem 7.1.14). The main tool needed is a distributional version of Fubini's theorem (Theorem 5.1.1 in Hörmander's book). I'll summarize the main points:
Preliminary step 1: Fubini's theorem
If $u$ and $v$ are distributions on ${\mathbb R}^m$ and ${\mathbb R}^n$ respectively, then there is a unique product distribution $u \otimes v$ on ${\mathbb R}^{m+n}$ characterized by the condition
$$
(u \otimes v)(\varphi \otimes \psi) = u(\varphi) v(\psi)
$$
for all $\varphi \in {\mathcal D}({\mathbb R}^m)$ and $\psi \in {\mathcal D}({\mathbb R}^n)$, where we write $(\varphi \otimes \psi)(x,y) := \varphi(x) \psi(y)$. Moreover, $u \otimes v$ can be evaluated on an arbitrary test function $\varphi \in {\mathcal D}({\mathbb R}^{m+n})$ by
$$
(u \otimes v)(\varphi) = u(x \mapsto v(\varphi(x,\cdot))) = v(y \mapsto u(\varphi(\cdot,y))).
$$
(Note that there is a slightly nontrivial exercise required to show that both expressions on the right make sense, e.g. that $x \mapsto v(\varphi(x,\cdot))$ defines a smooth compactly supported function on ${\mathbb R}^m$. This rests mainly on the fact that by uniform continuity, $x \mapsto \varphi(x,\cdot)$ is a continuous map to the space of test functions, and other arguments of this sort.) If you write down this formula in the case where $u$ and $v$ are given by locally integrable functions, you'll find that it follows easily from the classical Fubini's theorem.
If you know that the product distribution is unique, then the formula follows by verifying directly that both expressions on the right hand side define distributions that satisfy the defining property of a product distribution. Uniqueness can be proved via mollification: if $u \otimes v$ were not unique, then there would exist a nontrivial distribution $w$ on ${\mathbb R}^{m+n}$ such that $w(\varphi \otimes \psi) = 0$ for all $\varphi \in {\mathcal D}({\mathbb R}^m)$ and $\psi \in {\mathcal D}({\mathbb R}^n)$. Choose approximate identities, i.e. sequences of smooth functions $\rho_j : {\mathbb R}^m \to [0,\infty)$ and $\sigma_j : {\mathbb R}^n \to [0,\infty)$ with shrinking compact supports near $\{0\}$ that converge in the space of distributions to $\delta$-functions. Then the classical Fubini theorem implies that the sequence $\rho_j \otimes \sigma_j : {\mathbb R}^{m+n} \to [0,\infty)$ also defines an approximate identity in the same sense, and it follows that the sequence of smooth functions $(\rho_j \otimes \sigma_j) * w$ converges to $w$ in the space of distributions. But those functions are all $0$ due to the defining property of $w$, thus $w=0$.
Preliminary step 2: polynomial growth
Before we can view the function $g(\xi) := u(\chi(x) e^{-i x \xi})$ as a plausible candidate for the Fourier transform of $\chi u$, we need to know that it behaves reasonably enough at infinity to define a tempered distribution. As I indicated in my comments on the previous answer, $g$ is definitely not a Schwartz function in general, but one can show that it has polynomial growth. Perhaps the quickest way is to use standard properties of the Fourier transform and rewrite $g$ as
$$
g(\xi) = \left( ( {\mathcal F}\chi)^- * {\mathcal F}^*u\right)(-\xi),
$$
where I'm using the notation $f^-(x) := f(-x)$. As the convolution of a Schwartz function with a tempered distribution, it follows from standard results about the convolution that this function has polynomial growth.
The main argument
As stated in the question, we need to prove that the relation
$$
\int_{{\mathbb R}^n} u(\chi(x) e^{-i x \xi}) \phi(\xi) \, d\xi = u\left( \int_{{\mathbb R}^n} \chi(x) e^{- i x \xi} \phi(\xi)\, d\xi \right)
$$
holds for every $u \in {\mathcal D}'({\mathbb R}^n)$, $\chi \in {\mathcal D}({\mathbb R}^n)$ and $\phi \in {\mathcal S}({\mathbb R}^n)$. By step 2, we already know that both sides give well-defined tempered distributions when regarded as functionals of $\phi$, so by density, it will suffice to assume $\phi \in {\mathcal D}({\mathbb R}^n)$. The key observation now is that by the theorem in step 1, both sides can be identified with
$(u \otimes 1)(f)$, where $1 \in {\mathcal D}'({\mathbb R}^n)$ is the distribution $1(\varphi) := \int_{{\mathbb R}^n} \varphi(x)\, dx$ and $f \in {\mathcal D}({\mathbb R}^{m+n})$ is given by
$$
f(x,\xi) := \chi(x) \phi(\xi) e^{-i x \xi}.
$$
Best Answer
$\newcommand{\comb}{\mathrm{III}} % Dirac comb$ $\newcommand{\FT}{\hat}$ $\newcommand{\eps}{\epsilon}$ $\newcommand{\Scal}{\mathcal{S}}$ $\newcommand{\braket}[1]{\left\langle #1 \right\rangle}$ $\newcommand{\ZZ}{\mathbf{Z}}$ $\newcommand{\invFT}{\check}$ $\newcommand{\sinc}{\operatorname{sinc}}$ $\newcommand{\norm}[1]{\left\| #1 \right\|}$ $\newcommand{\supp}{\operatorname{supp}}$ $\newcommand{\abs}[1]{\left| #1 \right|}$ I think I found an answer.
Let $\Scal$ denote the space of Schwarz functions, let $\Scal'$ denote the space of tempered distributions, let $D^n$ denote the $n$-dimensional unit cube, let $\tau_x$ denote the translation map $f \mapsto f(\cdot - x)$ for a function defined on Euclidean space, and let $O_M$ denote the class of smooth functions whose derivatives of all orders are bounded by polynomial growth at infinity. We can show the following:
Proof: First suppose $X = \FT{x}$ is compactly supported on the closed cube $(1 - \mu)D^n$ where $\mu \in (0, 1)$. Choose $0 < \eps < \mu$ and let $\psi_\eps \in \Scal$ be any real symmetric window function that is equal to 1 on $(1 - \eps)D^n$, supported on $D^n$, and $0 \leq \psi_\eps \leq 1$. For any $\phi \in \Scal$ we have \begin{align*} \braket{X, \phi} &= \braket{X, \psi_\eps \cdot \phi} + \braket{X, (1 - \psi_\eps) \cdot \phi}\\ &= \braket{X, \psi_\eps \cdot \phi}\\ &= \braket{X, \sum_{k \in \ZZ^n} \tau_{k}( \psi_\eps \cdot \phi)} \quad \text{since only one of the terms in the sum has support intersecting $(1-\mu)D^n$}\\ &= \braket{X * \comb, \psi_\eps \cdot \phi}\\ &= \braket{\psi_\eps \cdot (X * \comb), \phi}\\ \implies X &= \psi_\eps \cdot (X * \comb). \end{align*} Taking the inverse Fourier transform gives the equality of distributions $$ x = \invFT{\psi_\eps} * (x \cdot \comb) = \sum_{k \in \ZZ^n} x(k) \tau_k \invFT{\psi_\eps}. $$ As $\eps \to 0$, $\psi_\eps$ approaches $\chi_{D^n}$ a.e. and hence in $L^1$ (by the dominated convergence theorem), so its inverse Fourier transform approaches $\sinc(t)$ uniformly. Since the sequence $(x(k))_k$ is in $l^1$, for all $\phi \in \Scal$ we get $$ \braket{\sum_{k \in \ZZ^n} x(k) \tau_k (\invFT{\psi_\eps} - \sinc), \phi} \leq \sup |\invFT{\psi_\eps} - \sinc| \cdot \norm{\phi}_1 \sum_{k \in \ZZ^n} |x(k)| \xrightarrow{\eps \to 0} 0. $$ Since $\eps$ was arbitrary, we have $$ x = \sum_{k \in \ZZ^n} x(k) \tau_k \sinc. $$ Now suppose that $x \in L^1$, $\{x(k)\}_k \in l^1(\ZZ^n)$ and $\supp(X) = D^n$. Then $X$ is a continuous compactly supported function, so it is in all $L^p$ spaces. Set $\psi_\eps$ as before, and let $X_\eps = \psi_\eps \cdot X, x_\eps = \invFT{X_\eps}$. By the above arguments, $x_\eps = \sum_k x_\eps(k)\sinc(t-k)$ as distributions (RHS sum interpreted in the topology of $\Scal'$). Note that the sequence $\{x_\eps(k)\}_k$ is in $l^2$ because it is the Fourier series coefficient sequence of $X_\eps \in L^2(D^n)$. Also $\{\sinc(\cdot - k)\}_k$ is an orthonormal set in $L^2$, so the RHS sum converges to a function in $L^2$. It also converges pointwise because $$ \begin{align*} \abs{\sum_k x_\eps(k)\sinc(t-k)} &\leq \sum_k |x_\eps(k)\sinc(t-k)| \\ &\leq \norm{x_\eps}_{l^2} \norm{\sinc(t - \cdot)}_{l^2} \quad \text{by Cauchy-Schwarz}\\ &< \infty. \end{align*} $$ Since the map $L^2 \to \Scal'$ defined by $f \mapsto (\phi \mapsto \int f \phi)$ is injective (see for example theorem 8.15 in Folland's Real Analysis), we must have that $$ x_\eps = \sum_k x_\eps(k)\sinc(t-k) $$ pointwise and in $L^2$.
Now set $\tilde x = \sum_k x(k) \sinc(t-k)$. By assumption the RHS sum converges uniformly (M-test) and in $L^2$ (same argument as for $x_\eps$). Then $$ \begin{align*} |x - \tilde x| &\leq |x - x_\eps| + |x_\eps - \tilde x| \\ &\leq \norm{x - x_\eps}_\infty + \abs*{\sum_k (x - x_\eps)(k) \sinc(t-k)}\\ &\leq \norm{x - x_\eps}_\infty + \norm{x - x_\eps}_{l^2} \norm{\sinc(t - \cdot)}_{l^2} \quad \text{by Cauchy-Schwarz.} \end{align*} $$ By construction $X_\eps \to X$ pointwise and hence in all $L^p$ by the dominated convergence theorem. In particular $X_\eps \to X$ in $L^1$, implying that $x_\eps \to x$ in $L^\infty$ so the first term tends to 0. Further $X_\eps \to X$ in $L^2(D^n)$ implies that $(x_\eps(k))_k \to (x(k))_k$ in $l^2$, so the second term can also be made arbitrarily small. Thus $x = \tilde x$.