A recommendation: just compute the PSRF separately for each scalar component
The original article by Gelman & Rubin [1], as well as the Bayesian Data Analysis textbook of Gelman et al. [2], recommends calculating the potential scale reduction factor (PSRF) separately for each scalar parameter of interest. To deduce convergence, it is then required that all PSRFs are close to 1. It does not matter that your parameters are interpreted as random vectors, their components are scalars for which you can compute PSRFs.
Brooks & Gelman [3] have proposed a multivariate extension of the PSRF, which I review in the next section of this answer. However, to quote Gelman & Shirley [4]:
[...] these methods
may sometimes represent overkill: individual parameters can be well estimated even while
approximate convergence of simulations of a multivariate distribution can take a very long time.
Alternative: multivariate extension by Brooks&Gelman
Brooks & Gelman [3] propose a multivariate extension of the PSRF, where indeed one computes the estimated covariance matrix (your step 4) as a weighted sum of the within-chain ($W$) and between-chain ($B$) covariance matrices (your step 3):
\begin{equation}
\hat{V} = \frac{n-1}{n}W + \left ( 1 + \frac{1}{m} \right )\frac{B}{n},
\end{equation}
where $n$ is the chain length. Then, one needs to define some scalar metric for the distance between the covariance matrices $\hat{V},W$. The authors propose
\begin{equation}
\hat{R} = \max_a \frac{a^T\hat{V}a}{a^TWa} = \frac{n-1}{n} + \left(\frac{m+1}{m}\right)\lambda_1,
\end{equation}
where $m$ is the number of chains, the equality is shown in the article with $\lambda_1$ being the largest positive eigenvalue of $W^{-1}\hat{V}/n$. Then, the authors argue that under convergence of the chains, $\lambda_1\rightarrow 0$ and thus with big $n$ this multivariate $\hat{R}$ should converge near 1.
References
[1] Gelman, Andrew, and Donald B. Rubin. "Inference from iterative simulation using multiple sequences." Statistical Science (1992): 457-472.
[2] Gelman, Andrew, et al. Bayesian data analysis. CRC press, 2013.
[3] Brooks, Stephen P., and Andrew Gelman. "General methods for monitoring convergence of iterative simulations." Journal of Computational and Graphical Statistics 7.4 (1998): 434-455.
[4] Gelman, Andrew, and Kenneth Shirley. "Inference from simulations
and monitoring convergence". (Chapter 6 in Brooks, Steve, et al., eds. Handbook of Markov Chain Monte Carlo. CRC Press, 2011.)
All articles except the textbook [2] are available at Andrew Gelman's website Andrew Gelman's website.
Best Answer
Note that individual chains have serial dependence; values from separate chains don't, so if you wanted it to look like one long chain, just concatenating them wouldn't look right.
However, if you're just interested in the distribution, the order in the chain is irrelevant. You don't actually seek to concatenate the chains for that, you simply want to pool all the distributional information (treat them as one big sample). Certainly, if the chains are all converged to their stationary distribution, they'll all be samples from the same distribution -- you can combine those.
Indeed some people run a burn-in period and just draw a single value from many separate chains.
(Keeping the runs separate might help to judge if they actually have converged.)
If you're computing variance that accounts for the dependence structure, however, you would base it on the fact that the different runs are independent, but the values from within the same run are dependent.