As it turns out you can use profile maximum likelihood to ensure the necessary properties. You can prove that for given $\hat\theta$, $l(\hat\theta,\Sigma)$ is maximised by
$$\hat\Sigma=\frac{1}{n}\sum_{i=1}^n\hat{\varepsilon}_i\hat{\varepsilon}_i',$$
where
$$\hat{\varepsilon}_i=y_i-f(x_i,\hat\theta)$$
Then it is possible to show that
$$\sum_{i=1}^n(y_i-f(x_i,\hat\theta))'\hat\Sigma^{-1}(y-f(x_i,\hat\theta)))=const,$$
hence we only need to maximise
$$l_R(\theta,\Sigma)=-\frac{n}{2} \log\det\hat\Sigma.$$
Naturally in this case $\Sigma$ will satisfy all the necessary properties. The proofs are identical for the case when $f$ is linear which can be found in Time Series Analysis by J. D. Hamilton page 295, hence I omitted them.
I will denote $\hat \theta$ the maximum likelihood estimator, while $\theta^{\left(m+1\right)}$ and $\theta^{\left(m\right)}$ are any two vectors. $\theta_0$ will denote the true value of the parameter vector. I am suppressing the appearance of the data.
The (untruncated) 2nd-order Taylor expansion of the log-likelihood viewed as a function of $\theta^{\left(m+1\right)}$, $\ell\left(\theta^{\left(m+1\right)}\right)$, centered at $\theta^{\left(m\right)}$ is (in a bit more compact notation than the one used by the OP)
$$\begin{align} \ell\left(\theta^{\left(m+1\right)}\right) =& \ell\left(\theta^{\left(m\right)}\right)+\frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\
+&\frac{1}{2}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)^{\prime}\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\
+&R_2\left(\theta^{\left(m+1\right)}\right) \\\end{align}$$
The derivative of the log-likelihood is (using the properties of matrix differentiation)
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right) = \frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta}
+\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)
+\frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right) $$
Assume that we require that
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\mathbf 0$$
Then we obtain
$$\theta^{\left(m+1\right)}=\theta^{\left(m\right)}-\left[\mathbb{H}\left(\theta^{\left(m\right)}\right)\right]^{-1}\left[\mathbb{S}\left(\theta^{\left(m\right)}\right)\right]$$
This last formula shows how the value of the candidate $\theta$ vector is updated in each step of the algorithm. And we also see how the updating rule was obtained:$\theta^{\left(m+1\right)}$ must be chosen so as its total marginal effect on the log-likelihood equals its marginal effect on the Taylor remainder. In this way we "contain" how much the derivative of the log-likelihood strays away from the value zero.
If (and when) it so happens that $\theta^{\left(m\right)} = \hat \theta$ we will obtain
$$\theta^{\left(m+1\right)}=\hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\left[\mathbb{S}\left(\hat \theta\right)\right]= \hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\cdot \mathbf 0 = \hat \theta$$
since by construction $\hat \theta$ makes the gradient of the log-likelihood zero. This tells us that once we "hit" $\hat \theta$, we are not going anyplace else after that, which, in an intuitive way, validates our decision to essentially ignore the remainder, in order to calculate $\theta^{\left(m+1\right)}$. If the conditions for quadratic convergence of the algorithm are met, we have essentially a contraction mapping, and the MLE estimate is the (or a) fixed point of it. Note that if $\theta^{\left(m\right)} = \hat \theta$ then the remainder becomes also zero and then we have
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right)=\mathbf 0$$
So our method is internally consistent.
DISTRIBUTION OF $\hat \theta$
To obtain the asymptotic distribution of the MLE estimator we apply the Mean Value theorem according to which, if the log-likelihood is continuous and differentiable, then
$$\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right) = \frac{\partial\ell\left(\theta_0\right)}{\partial\theta}
+\frac{\partial^{2}\ell\left(\bar \theta\right)}{\partial\theta\partial\theta^{\prime}}\left(\hat \theta-\theta_0\right) = \mathbf 0$$
where $\bar \theta$ is a mean value between $\hat \theta$ and $\theta_0$. Then
$$\left(\hat \theta-\theta_0\right) = -\left[\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\mathbb{S}\left( \theta_0\right)\right]$$
$$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) = -\left[\frac 1n\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right]$$
Under the appropriate assumptions, the MLE is a consistent estimator. Then so is $\bar \theta$, since it is sandwiched between the MLE and the true value. Under the assumption that our data is stationary, and one more technical condition (a local dominance condition that guarantees that the expected value of the supremum of the Hessian in a neighborhood of the true value is finite) we have
$$\frac 1n\mathbb{H}\left(\bar \theta\right) \rightarrow_p E\left[\mathbb{H}\left(\theta_0\right)\right]$$
Moreover, if interchange of integration and differentiation is valid (which usually will be), then
$$E\left[\mathbb{S}\left( \theta_0\right)\right]=\mathbf 0$$
This, together with the assumption that our data is i.i.d, permits us to use the Lindeberg-Levy CLT and conclude that
$$\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right] \rightarrow_d N(\mathbf 0, \Sigma),\qquad \Sigma = E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]$$
and then, by applying Slutzky's Theorem, that
$$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) \rightarrow_d N\left(\mathbf 0, \operatorname{Avar}\right)$$
with
$$\operatorname{Avar} = \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}\cdot \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)\cdot \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}$$
But the information matrix equality states that
$$-\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big) = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)$$
and so
$$\operatorname{Avar} = -\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1} = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)^{-1}$$
Then for large samples the distribution of $\hat \theta$ is approximated by
$$\hat \theta \sim _{approx} N\left(\theta_0, \frac 1n\operatorname {\widehat Avar}\right)$$
for a consistent estimator for $\operatorname {\widehat Avar}$ (the sample analogues of the expected values involved are such consistent estimators).
Best Answer
The EM algorithm is typically used to find MLEs of the parameters from an iid multivariate t sample. Instead of writting a long answer as to how to implement the algorithm I refer you to McLachlan and Krishnan "The EM algorithm and Extensions", second edition. They show the MLE procedure for both known and unknown degrees of freedom, as well as ways to accelerate the algorithm. If you don't have access to the book you can also look at this paper
https://arxiv.org/pdf/1707.01130.pdf
It turns out that estimating the degrees of freedom using ML results in an unbounded score function. The above paper finds estimators that minimize the MLq, which results in a bounded score. They also have a great review of the EM algorithm for the usual ML procedure.
Once you have estimators for both the degrees of freedom and the scale parameter, finding an estimate of the covariance is straight forward.