Actually $s$ doesn't need to systematically underestimate $\sigma$; this could happen even if that weren't true.
As it is, $s$ is biased for $\sigma$ (the fact that $s^2$ is unbiased for $\sigma^2$ means that $s$ will be biased for $\sigma$, due to Jensen's inequality*, but that's not the central thing going on there.
* Jensen's inequality
If $g$ is a convex function, $g\left(\text{E}[X]\right) \leq \text{E}\left[g(X)\right]$
with equality only if $X$ is constant or $g$ is linear.
Now $g(X)=-\sqrt{X}$ is convex,
so $-\sqrt{\text{E}[X]} < \text{E}(-\sqrt{X})$,
i.e.
$\sqrt{\text{E}[X]} > \text{E}(\sqrt{X})\,$, implying $\sigma>E(s)$ if the random variable $s$ is not a fixed constant.
Edit: a simpler demonstration not invoking Jensen --
Assume that the distribution of the underlying variable has $\sigma>0$.
Note that $\text{Var}(s) = E(s^2)-E(s)^2$ this variance will always be positive for $\sigma>0$.
Hence $E(s)^2 = E(s^2)-\text{Var}(s) < \sigma^2$, so $E(s)<\sigma$.
So what is the main issue?
Let $Z=\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}}$
Note that you're dealing with $t=Z\cdot\frac{\sigma}{s}$.
That inversion of $s$ is important. So the effect on the variance it's not whether $s$ is smaller than $\sigma$ on average (though it is, very slightly), but whether $1/s$ is larger than $1/\sigma$ on average (and those two things are NOT the same thing).
And it is larger, to a greater extent than its inverse is smaller.
Which is to say $E(1/X)\neq 1/E(X)$; in fact, from Jensen's inequality:
$g(X) = 1/x$ is convex, so if $X$ is not constant,
$1/\left(\text{E}[X]\right) < \text{E}\left[1/X\right]$
So consider, for example, normal samples of size 10; $s$ is about 2.7% smaller than $\sigma$ on average, but $1/s$ is about 9.4% larger than $1/\sigma$ on average. So even if at n=10 we made our estimate of $\sigma$ 2.7-something percent larger** so that $E(\widehat\sigma)=\sigma$, the corresponding $t=Z\cdot\frac{\sigma}{\widehat\sigma}$ would not have unit variance - it would still be a fair bit larger than 1.
**(at other $n$ the adjustment would be different of course)
Since the t-distribution is like the standard normal distribution but with a higher variance (smaller peak and fatter tails)
If you adjust for the difference in spread, the peak is higher.
Why does the t-distribution become more normal as sample size increases?
The standard normal distribution vs the t-distribution
Just to clarify on relation to the title, we aren't using the t-distribution to estimate the mean (in the sense of a point estimate at least), but to construct an interval for it.
But why use an estimate when you can get your confidence interval exactly?
It's a good question (as long as we don't get too insistent on 'exactly', since the assumptions for it to be exactly t-distributed won't actually hold).
"You must use the t-distribution table when working problems when the population standard deviation (σ) is not known and the sample size is small (n<30)"
Why don't people use the T-distribution all the time when the population standard deviation is not known (even when n>30)?
I regard the advice as - at best - potentially misleading. In some situations, the t-distribution should still be used when degrees of freedom are a good deal larger than that.
Where the normal is a reasonable approximation depends on a variety of things (and so depends on the situation). However, since (with computers) it's not at all difficult to just use the $t$, even if the d.f. are very large, you'd have to wonder why the need to worry about doing something different at n=30.
If the sample sizes are really large, it won't make a noticeable difference to a confidence interval, but I don't think n=30 is always sufficiently close to 'really large'.
There is one circumstance in which it might make sense to use the normal rather than the $t$ - that's when your data clearly don't satisfy the conditions to get a t-distribution, but you can still argue for approximate normality of the mean (if $n$ is quite large). However, in those circumstances, often the t is a good approximation in practice, and may be somewhat 'safer'. [In a situation like that, I might be inclined to investigate via simulation.]
Best Answer
Normal distributions with very different standard deviation can have the same mean, so knowing the mean doesn't tell you which standard deviation you had. Indeed for samples from the normal distribution, the sample mean and sample standard deviation are independent, so the mean doesn't tell you anything about the standard deviation.
Then you cannot have a normal distribution (normal distributions are necessarily unbounded). On the other hand, the mean and the two bounds together do impose an upper limit on the standard deviation, but it's pretty weak.