Actually $s$ doesn't need to systematically underestimate $\sigma$; this could happen even if that weren't true.

As it is, $s$ is biased for $\sigma$ (the fact that $s^2$ is unbiased for $\sigma^2$ means that $s$ will be biased for $\sigma$, due to Jensen's inequality*, but that's not the central thing going on there.

* *Jensen's inequality*

If $g$ is a convex function, $g\left(\text{E}[X]\right) \leq \text{E}\left[g(X)\right]$
with equality only if $X$ is constant or $g$ is linear.

Now $g(X)=-\sqrt{X}$ is convex,

so $-\sqrt{\text{E}[X]} < \text{E}(-\sqrt{X})$,
i.e.
$\sqrt{\text{E}[X]} > \text{E}(\sqrt{X})\,$, implying $\sigma>E(s)$ if the random variable $s$ is not a fixed constant.

Edit: a simpler demonstration not invoking Jensen --

Assume that the distribution of the underlying variable has $\sigma>0$.

Note that $\text{Var}(s) = E(s^2)-E(s)^2$ this variance will always be positive for $\sigma>0$.

Hence $E(s)^2 = E(s^2)-\text{Var}(s) < \sigma^2$, so $E(s)<\sigma$.

*So what is the main issue?*

Let $Z=\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}}$

Note that you're dealing with $t=Z\cdot\frac{\sigma}{s}$.

That inversion of $s$ is important. So the effect on the variance it's not whether $s$ is smaller than $\sigma$ on average (though it is, very slightly), but whether $1/s$ is larger than $1/\sigma$ on average (and those two things are NOT the same thing).

And it is larger, to a greater extent than its inverse is smaller.

Which is to say $E(1/X)\neq 1/E(X)$; in fact, from Jensen's inequality:

$g(X) = 1/x$ is convex, so if $X$ is not constant,

$1/\left(\text{E}[X]\right) < \text{E}\left[1/X\right]$

So consider, for example, normal samples of size 10; $s$ is about 2.7% smaller than $\sigma$ on average, but $1/s$ is about 9.4% *larger* than $1/\sigma$ on average. So even if at n=10 we made our estimate of $\sigma$ 2.7-something percent larger** so that $E(\widehat\sigma)=\sigma$, the corresponding $t=Z\cdot\frac{\sigma}{\widehat\sigma}$ would not have unit variance - it would still be a fair bit larger than 1.

**(at other $n$ the adjustment would be different of course)

Since the t-distribution is like the standard normal distribution but with a higher variance (smaller peak and fatter tails)

If you adjust for the difference in spread, the peak is higher.

Why does the t-distribution become more normal as sample size increases?

The standard normal distribution vs the t-distribution

## Best Answer

Hint: