Intuitive solution: Because constraining always hurts one is not that flexible like in unrestricted cases. Hence the variance of the restricted model could be at most as high as the variance of the unrestricted model.
Mathematical: The product of characteristics X (X'X) is kind of sum of squares and is similar to the covariance matrix of characteristics X (is not exactly the covariance matrix). This matrix is positive definite. In more deep multiplying (X'X)^(-1) with sigma^2 is the variance- covariance matrix of the Least Squares estimator which is also positive definite because sigma^2 is the variance of the true residuals and is always positive.
Hence it is guaranteed that the product is also positive semi-definite. For this reason the variance of the unrestricted estimator is greater or equal than the restricted estimator.
Suppose we are working with a linear model $Y = X\beta + \varepsilon$ for $\varepsilon \sim \mathcal N(X\beta, \sigma^2 I)$. Then (up to a constant) the log likelihood $l$ of $\beta$ is given by
$$
-2 \times l(\beta, \sigma^2 | Y) = \frac{1}{\sigma^2}|| Y - X\beta||^2.
$$
Recall the definition of the AIC:
$$
AIC(\hat \beta, \hat \sigma^2) = -2 l(\hat \beta, \hat \sigma^2 | y) + 2 p
$$
where $p$ is the dimension of our model.
We have that
$$
C_p = \frac{1}{\hat \sigma^2} ||Y - X\hat \beta||^2 + 2p - 2
$$
so we can see that the AIC and $C_p$ differ only by a constant, therefore their respective argminima are the same. As @DJohnson mentioned in the comments, $C_p$ is only ever really used for variable selection, i.e. we care about its argmin rather than its actual values. This means that (for this particular model, at least) we can interpret argminima of $C_p$ in terms of the argminima of the AIC and there's a whole body of work on that. See here or here, for instance.
In effect, I'm completely echoing DJohnson's comment that this isn't a particularly useful statistic and there's no point in wasting time trying to understand it by itself. I advocate for framing it in terms of AIC, which is definitely worth understanding (even if you don't like it or use it), and then putting your mental effort there (and on related *IC things like BIC, AICc, and etc).
Best Answer
Consider a simple regression without a constant term, and where the single regressor is centered on its sample mean. Then $X'X$ is ($n$ times) its sample variance, and $(X'X)^{-1}$ its recirpocal. So the higher the variance = variability in the regressor, the lower the variance of the coefficient estimator: the more variability we have in the explanatory variable, the more accurately we can estimate the unknown coefficient.
Why? Because the more varying a regressor is, the more information it contains. When regressors are many, this generalizes to the inverse of their variance-covariance matrix, which takes also into account the co-variability of the regressors. In the extreme case where $X'X$ is diagonal, then the precision for each estimated coefficient depends only on the variance/variability of the associated regressor (given the variance of the error term).