Let's start by stripping out the Kalman filter terms, because this seems to be where you're initially confused.
Your dynamical system looks like this:
$$\begin{align*}
x_t &= F x_{t-1} + w_t, \\
z_t &= H x_t + v_t.
\end{align*}
$$
I've dropped the subscript $t$ from the matrices since we'll assume for now that those aren't going to change with time.
Your system dynamics are described by $F$. This is sometimes called the "plant matrix". This describes how your states evolve in terms of a previous state vector. So this matrix will completely describe your system dynamics.
The matrix $H$ is your observability matrix. You cannot necessarily directly observe every state. If you can, this term would be the identity matrix. But sometimes what you can observe is a sum of other states. For instance, you might not be able to directly measure the RPMs of your transmission output shaft, but you can measure the RPMs of your engine and your wheels. In this case, your tire RPM is your engine RPM stepped down (or up) by your transmission gear ratio. So while you cannot observe it directly, you can certainly write it as a function of other measurable states.
You should know your $F$ and $H$ matrices, in general, from first principles. In practice, you don't always know them, and you need to estimate them. But that's a different problem not entirely related to Kalman filtering, so for now I'll assume you can derive them.
Kalman filtering comes in because of the noise terms, $w_t$ and $v_t$. There are unmeasurable process noises -- turbulence, manufacturing flaws, etc. -- which are described in $w_t$, and there are sensor noises -- line noise, sensor manufacturing flaws, etc. -- which are described by $v_t$.
The noise in each sensor state is not necessarily independent. For example, if you're flying an airplane, turbulence that affects your left wing will affect every state in your vehicle dynamics. So the effect of process noise is not independent with respect to each state.
Your $R$ and $Q$ matrices describe the interaction between these uncertainty effects. You can estimate them using any of several techniques. Which technique works best is really dependent on your application and your data. Certainly, if you have lots of real-world measurements, you can do some autocovariance computations and approximate these matrices.
Finally, your matrix $P$ is what the Kalman filter is really all about. $P$ is a measure of the following quantity: "given some observed data $z_{t-1}$, where do we expect our states to be." Recall that the fundamental problem is that not all states are directly observable! And -- even when they are, there is noise. But, we have some pretty good faith in the accuracy of what we can observe. So given what we know about the system dynamics and what we observe, can we make a prediction about the state of the system.
The initial estimate of $P$ describes how well we really understand our initial conditions. Sometimes, we can pin them down exactly (for instance, measuring something on the ground). Other times, we're left with imperfect data (for instance, using radar data to estimate where a satellite is).
Following the symbols on Wikipedia, the Kalman gain is $K_k = P_{k|k-1}H^T_k S_{k-1}^{-1}$
$S_{k-1}^{-1}$ is equivalent to what you have called $\Sigma_t^{-1}$ in your question, so it suffices to show the equivalence of $P_{k|k-1}H_k^T$ and $\Sigma^{x,t}_z$.
Cross covariance between the predicted state $\hat{x}_{k|k-1}$ and the predicted measurement $H_k \hat{x}_{k|k-1} + v_k$ is (which you have called $\Sigma_{z}^{x,t}$) is defined $E[(\hat{x}_{k|k-1} - E[\hat{x}_{k|k-1}])(H_k\hat{x}_{k|k-1} + v_k - E[H_k\hat{x}_{k|k-1}+v_k])^T]$.
$$E[(\hat{x}_{k|k-1} - E[\hat{x}_{k|k-1}])(H_k\hat{x}_{k|k-1} + v_k - E[H_k\hat{x}_{k|k-1}+v_k])^T]$$
Now use linearity of expectation to factor out $H_k$.
$$= E[(\hat{x}_{k|k-1} - E[\hat{x}_{k|k-1}])(H_k(\hat{x}_{k|k-1}- E[\hat{x}_{k|k-1}])+v_k - E[v_k])^T]. $$
Define $\Delta \hat{x}_{k|k-1} \equiv \hat{x}_{k|k-1} - E[\hat{x}_{k|k-1}]$ and $\Delta v_k \equiv v_k - E[v_k]$.
$$= E[\Delta \hat{x}_{k|k-1}(H_k\Delta \hat{x}_{k|k-1}+\Delta v_k)^T]. $$
Distribute the transpose.
$$= E[\Delta \hat{x}_{k|k-1}(\Delta \hat{x}_{k|k-1}^T H_k^T+\Delta v_k^T)]. $$
Distribute.
$$= E[\Delta \hat{x}_{k|k-1}\Delta \hat{x}_{k|k-1}^T] H^T + E[\Delta \hat{x}_{k|k-1} \Delta v_k^T]. $$
Now use independence of $v_k$ to distribute the expectation across the product.
$$= E[\Delta \hat{x}_{k|k-1}\Delta \hat{x}_{k|k-1}^T] H^T + E[\Delta \hat{x}_{k|k-1}]E[ \Delta v_k^T]. $$
$$= E[\Delta \hat{x}_{k|k-1}\Delta \hat{x}_{k|k-1}^T] H^T + 0 $$
$$= P_{k|k-1} H_k^T $$
So in the end, the equivalence follows from the standard Kalman filter assumptions. The simplification is not possible for the UKF case because we do not have the matrix $H_k$.
You can, however, get a completely mechanical intuition about the information content of the cross covariance. It answers the following question: "If I wiggle state variable $i$ inside $\Delta \hat{x}_{k|k-1}$ upwards, how likely is it that observation variable $j$ of $(H_k \hat{x}_{k|k-1} + v_k)$ will also go up?" This what the entry $(i,j)$ of the crosscovariance means. And it makes sense that this information would be relevant for the operation of the UKF.
Best Answer
Here might be a better approach (from link)