Remember, operators are nothing but maps. Expectation value of an operator is pretty much defined (I guess) in general operator theory. It just turns out that in QM (Hermitian) operators correspond to dynamical variables. In general you can also calculate expectation values of operators like $L_+$ and $a^{\dagger}$ etc., which don't have any dynamical variables associated !!
On a more general note, if you want to get a representation for the operator (given some basis, which could be the eigenset of an operator), the expectation values would be the diagonal elements of the representation.
It is related to Linear algebra, This might throw some light on Operator theory. Linear operators (bounded of course) are maps defined in LVS that take one vector to another vector in the same LVS.
$$ \hat O : V \rightarrow V$$
$$ \hat O \left|\psi\right> = \left|\phi\right> $$
Then you have the inner product defined on the LVS, that takes two vectors to one complex number :
$$ \left<\psi|\phi\right> : V^* \times V \rightarrow \mathbb{C} $$
Using these two, one can define the expectation value of an operator to be,
$$ <\hat O> = \left<\psi\right|\; \big(\hat O\left|\psi\right>\big) = \left<\psi|\phi\right> \in \mathbb{C} $$
I'm going to try to explain why and how density operators in quantum mechanics correspond to random variables in classical probability theory, something none of the other answers have even tried to do.
Let's work in a two-dimensional quantum space. We'll use standard physics bra-ket notation. A quantum state is a column vector in this space, and we'll represent a column vector as $\alpha|0\rangle + \beta |1 \rangle.$ A row vector is $\gamma \langle 0 | + \delta \langle 1 |\,$.
Now, you might think that a probability distribution is a measure on quantum states. You can think of it that way, but it turns out that this is too much information. For example, consider two probability distributions on quantum states. First, let's take the probability distribution
$$ \begin{array}{cc}
|0\rangle & \mathrm{with\ probability\ }2/3,\\
|1\rangle & \mathrm{with\ probability\ }1/3.
\end{array}
$$
Next, let's take the probability distribution
$$ \begin{array}{cc}
\sqrt{{2}/{3}}\,\left|0\right\rangle +\sqrt{1/3}\, \left|1\right\rangle & \mathrm{with\ probability\ }1/2,\\
\sqrt{{2}/{3}}\,\left|0\right\rangle -\sqrt{1/3}\, \left|1\right\rangle & \mathrm{with\ probability\ }1/2.
\end{array}
$$
It turns out that these two probability distributions are indistinguishable. That is, any measurement you make on one will give exactly the same probability distribution of results that you make on the other. The reason for that is that
$$
\frac{2}{3} |0\rangle\langle0| +\frac{1}{3}|1\rangle\langle 1|
$$
and
$$
\frac{1}{2}\left(\sqrt{2/3}\left|0\right\rangle +\sqrt{1/3}\, \left|1\right\rangle\right)
\left(\sqrt{2/3}\left\langle 0\right| +\sqrt{1/3}\, \left\langle 1\right|\right)
+\frac{1}{2}\left(\sqrt{{2}/{3}}\left|0\right\rangle -\sqrt{1/3}\, \left|1\right\rangle\right)
\left(\sqrt{2/3}\left\langle 0\right| -\sqrt{1/3}\, \left\langle 1\right|\right)
$$
are the same matrix.
That is, a probability distribution on quantum states is an overly specified distribution, and it is quite cumbersome to work with. We can predict any experimental outcome for a probability distribution on quantum states if we know the corresponding density operator, and many probability distributions yield the same density operator. If we have a probability density $\mu_v$ on quantum states $v$, we can predict any experimental outcome from the density operator
$$
\int v v^* d \mu_v \,.
$$
So for quantum probability theory, instead of working with probability distributions on quantum states, we work with density operators instead.
Classical states correspond to orthonormal vectors in Hilbert space, and classical probability distributions correspond to diagonal density operators.
Best Answer
Since you want a bit of mathematical rigor:
A quantum state is a self-adjoint positive trace class operator on a Hilbert space with trace 1. This is called density matrix $\rho$. In its simplest form, given $\psi\in \mathscr{H}$, $\rho$ is the orthogonal projector on the subspace spanned by $\psi$. Let $E_\rho(\cdot):D_\rho\subset\mathcal{A}(\mathscr{H})\to \mathbb{R}$ be the map defined as: $$E_\rho(A)=\mathrm{Tr}(A\rho)\; ,$$ where $\mathcal{A}(\mathscr{H})$ is the space of self-adjoint operators, $\mathrm{Tr}$ is the trace on $\mathscr{H}$ and $$D_\rho=\{A\in \mathcal{A}(\mathscr{H})\; ,\; \mathrm{Tr}\lvert A\rho\rvert<+\infty\}\; .$$ The map $E_\rho(\cdot)$ has all the properties of an expectation in probability theory. I don't know if it is possible to characterize the measure $\mu$ associated to it (maybe by means of the projection valued measures associated to $\rho$ by the spectral theorem, but it is not straightforward at least for me).