But I'm not convinced we should always apply this rule any time we cut
into $2n$ intervals. Why not just throw out the uneven weighting and
use a few more sample points? If the weighting is so helpful, why not
use a more complicated weighting (like the various n-point rules
(Newton-Cotes formulas) ... )?
The problem with Newton-Cotes methods of high order is that it inherits the same sort of problems you see with using high-order interpolating polynomials. Remember that the Newton-Cotes quadrature rules are based on integrating interpolating polynomial approximations to your function over equally spaced points.
In particular, there is the Runge phenomenon: high-order interpolating functions are in general quite oscillatory. This oscillation manifests itself in the weights of the Newton-Cotes rules: in particular, the weights of Newton-Cotes quadrature rules for 2 to 8 points and and 10 points (Simpson's is the three-point rule) are all positive, but in all the other cases, there are negative weights present. The reason for insisting on weights of the same sign for a quadrature rule is the phenomenon of subtractive cancellation, where two nearly equal quantities are subtracted, giving a result that has less significant digits. By ensuring that the all weights have the same sign, any cancellation that may occur in the computation is due to the function itself being integrated (e.g. the function has a simple zero within the integration interval) and not due to the quadrature rule.
The approach of breaking up a function into smaller intervals and applying a low-order quadrature rule like Simpson's is effectively the integration of a piecewise polynomial approximation. Since piecewise polynomials are known to have better approximation properties than interpolating polynomials, this good behavior is inherited by the quadrature method.
On the other hand, one can still salvage the interpolating polynomial approach if one no longer insists on having equally-spaced sample points. This gives rise to e.g. Gaussian and Clenshaw-Curtis quadrature rules, where the sample points are taken to be the roots of Legendre polynomials in the former, and roots (or extrema in some implementations) of Chebyshev polynomials in the latter. (Discussing these would make this answer too long, so I shall say no more about them, except that these quadrature rules tend to be more accurate than the corresponding Newton-Cotes rule for the same number of function evaluations.)
...is Simpson's rule so useful that calculus students should always use it for approximations?
As with any tool, blind use can lead you to a heap of trouble. In particular, we know that a polynomial can never have horizontal asymptotes or vertical tangents. It stands to reason that a polynomial will be a poor approximation to a function with these features, and thus a quadrature rule based on interpolating polynomials will also behave poorly. The piecewise approach helps a bit, but not much. One should always consider a (clever?) change of variables to eliminate such features before applying a quadrature rule.
Let $f:[a,b]\to\mathbb R$ and $x_i = a + ih$, $f_i = f(x_i)$, for $i= 1, \dotsc, N$ and $h = (b-a) / (N+1)$.
We want to approximate $\int_a^b f$ using a quadrature rule of at least order $4$. Thus, we apply Milne's rule for $x_1,x_2,x_3$ and $x_{N-2},x_{N-1}, x_N$, and Simpson's rule for $x_4,\dotsc, x_{N-3}$.
Assume $N = 3M$.
We obtain:
\begin{align}
\int_a^{x_4} f & \approx \frac{4h}{3} (2f_1 - f_2 + 2f_3), \\
\int_{x_{3i+1}}^{x_{3i+3}} f & \approx \frac{2h}{6} (f_{3i+1} + 4f_{3i+2} + f_{3i+3}), \quad i=1,\dotsc, M-2, \\
\int_{x_{N-3}}^b f & \approx \frac{4h}{3} (2f_{N-2} - f_{N-1} + 2f_{N}).
\end{align}
Best Answer
Newton-Cotes says "pick evenly spaced points in the interval, draw the interpolating polynomial of minimal degree through them, and integrate the polynomial." From the Lagrange interpolation theorem, a Newton-Cotes rule on $n$ nodes is exact for polynomials of degree at most $n-1$. It is not exact for polynomials of any higher degree, because one can add a polynomial of degree $n$ which vanishes at the $n$ nodes and takes on any desired value at an additional point, and the Newton-Cotes rule will not "detect" this change.
Gaussian quadrature says "find a rule that lets us exactly integrate polynomials of as large a degree as possible." This winds up being degree $2n-1$, since we have control of $2n$ parameters in specifying a quadrature rule at $n$ nodes, and the parameters are independent.
Analysis of either method is generally inspired by the Weierstrass approximation theorem, which generates the following type of argument. Here $I$ denotes exact integration, $Q_n$ denotes quadrature integration with a rule on $n$ nodes, and $p$ is a polynomial.
$$| I(f) - Q_n(f) | \leq | I(f) - I(p) | + | I(p) - Q_n(p) | + | Q_n(p) - Q_n(f) | $$
The Weierstrass theorem and the fact that $I$ is a bounded linear functional (in particular, $\| I \| = I(1)$) allow us to make the first term small, possibly at the expense of $p$ having a large degree.
Once the degree of $p$ is fixed, choosing $n$ large enough enables us to make the second term zero, since both Newton-Cotes and Gaussian quadrature can be made exact for polynomials by taking enough nodes.
The problem for the numerical analyst is to control the last term, when $\| p - f \|$ is small and $n$ might be arbitrarily large.
Now $Q_n$, like $I$, is linear, so we can write
$$Q_n(p) - Q_n(f) = Q_n(p-f).$$
Also, $Q_n$ is bounded:
$$|Q_n(f)| \leq \| f \| \sum_{k=1}^n |w_k|,$$
where $w_k$ are the weights. Hence
$$\| Q_n \| \leq \sum_{k=1}^n |w_k|.$$
This inequality is actually an equality, as we can check by integrating $f$ such that $\| f \|=1$ and $f(x_k)=\text{sign}(w_k)$.
The fact that the $Q_n$ are each bounded is not good enough. Since simultaneously controlling the first two terms takes away our control over $n$, we need $Q_n$ to be uniformly bounded.
It turns out that Newton-Cotes rules do not have $Q_n$ uniformly bounded, whereas Gaussian quadrature does. The idea is to recognize that, because these methods are exact for polynomials of degree $0$, $\sum_{k=1}^n w_k = I(1)$. If the $w_k$ are all positive, then $\| Q_n \| = \sum_{k=1}^n |w_k| = I(1)$, which is independent of $n$, as we want.
One can prove from the property of being exact for polynomials of degree $2n-1$ that all of the weights in Gaussian quadrature are positive. (The trick is to integrate $\prod_{k=1,k \neq j}^n (x-x_k)^2$ for each $j$.) Consequently Gaussian quadrature is convergent. In particular, if there exists $p$ such that $\| p - f \| < \varepsilon/(2I(1))$ and the degree of $p$ is at most $m$, then $|I(f)-Q_n(f)|<\varepsilon$ provided $n \geq (m+1)/2$.
The weights in Newton-Cotes quadrature are not positive. Moreover, $\| Q_n \|$ is unbounded for Newton-Cotes rules. This means that if $f=p+g$, where $p$ is a polynomial of degree at most $n-1$ and $g$ is a continuous function with $g(x_k)=\| g \| \text{sign}(w_k)$, then $|Q_n(p-f)|=\| Q_n \| \| g \| \gg \| g \|$.
Consequently Newton-Cotes rules are not convergent for all continuous $f$. Very "nice" functions $f$ will converge with Newton-Cotes rules, but Runge's well-known example with $1/(x^2+1)$ shows that the number of derivatives is not what "nice" means. (The answer turns out to be that the Fourier coefficients decay sufficiently fast.)