I will denote $\hat \theta$ the maximum likelihood estimator, while $\theta^{\left(m+1\right)}$ and $\theta^{\left(m\right)}$ are any two vectors. $\theta_0$ will denote the true value of the parameter vector. I am suppressing the appearance of the data.
The (untruncated) 2nd-order Taylor expansion of the log-likelihood viewed as a function of $\theta^{\left(m+1\right)}$, $\ell\left(\theta^{\left(m+1\right)}\right)$, centered at $\theta^{\left(m\right)}$ is (in a bit more compact notation than the one used by the OP)
$$\begin{align} \ell\left(\theta^{\left(m+1\right)}\right) =& \ell\left(\theta^{\left(m\right)}\right)+\frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\
+&\frac{1}{2}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)^{\prime}\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)\\
+&R_2\left(\theta^{\left(m+1\right)}\right) \\\end{align}$$
The derivative of the log-likelihood is (using the properties of matrix differentiation)
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right) = \frac{\partial\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta}
+\frac{\partial^{2}\ell\left(\theta^{\left(m\right)}\right)}{\partial\theta\partial\theta^{\prime}}\left(\theta^{\left(m+1\right)}-\theta^{\left(m\right)}\right)
+\frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right) $$
Assume that we require that
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\mathbf 0$$
Then we obtain
$$\theta^{\left(m+1\right)}=\theta^{\left(m\right)}-\left[\mathbb{H}\left(\theta^{\left(m\right)}\right)\right]^{-1}\left[\mathbb{S}\left(\theta^{\left(m\right)}\right)\right]$$
This last formula shows how the value of the candidate $\theta$ vector is updated in each step of the algorithm. And we also see how the updating rule was obtained:$\theta^{\left(m+1\right)}$ must be chosen so as its total marginal effect on the log-likelihood equals its marginal effect on the Taylor remainder. In this way we "contain" how much the derivative of the log-likelihood strays away from the value zero.
If (and when) it so happens that $\theta^{\left(m\right)} = \hat \theta$ we will obtain
$$\theta^{\left(m+1\right)}=\hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\left[\mathbb{S}\left(\hat \theta\right)\right]= \hat \theta-\left[\mathbb{H}\left(\hat \theta\right)\right]^{-1}\cdot \mathbf 0 = \hat \theta$$
since by construction $\hat \theta$ makes the gradient of the log-likelihood zero. This tells us that once we "hit" $\hat \theta$, we are not going anyplace else after that, which, in an intuitive way, validates our decision to essentially ignore the remainder, in order to calculate $\theta^{\left(m+1\right)}$. If the conditions for quadratic convergence of the algorithm are met, we have essentially a contraction mapping, and the MLE estimate is the (or a) fixed point of it. Note that if $\theta^{\left(m\right)} = \hat \theta$ then the remainder becomes also zero and then we have
$$\frac{\partial}{\partial \theta^{\left(m+1\right)}}\ell\left(\theta^{\left(m+1\right)}\right)- \frac{\partial}{\partial \theta^{\left(m+1\right)}}R_2\left(\theta^{\left(m+1\right)}\right)=\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right)=\mathbf 0$$
So our method is internally consistent.
DISTRIBUTION OF $\hat \theta$
To obtain the asymptotic distribution of the MLE estimator we apply the Mean Value theorem according to which, if the log-likelihood is continuous and differentiable, then
$$\frac{\partial}{\partial \theta}\ell\left(\hat \theta\right) = \frac{\partial\ell\left(\theta_0\right)}{\partial\theta}
+\frac{\partial^{2}\ell\left(\bar \theta\right)}{\partial\theta\partial\theta^{\prime}}\left(\hat \theta-\theta_0\right) = \mathbf 0$$
where $\bar \theta$ is a mean value between $\hat \theta$ and $\theta_0$. Then
$$\left(\hat \theta-\theta_0\right) = -\left[\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\mathbb{S}\left( \theta_0\right)\right]$$
$$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) = -\left[\frac 1n\mathbb{H}\left(\bar \theta\right)\right]^{-1}\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right]$$
Under the appropriate assumptions, the MLE is a consistent estimator. Then so is $\bar \theta$, since it is sandwiched between the MLE and the true value. Under the assumption that our data is stationary, and one more technical condition (a local dominance condition that guarantees that the expected value of the supremum of the Hessian in a neighborhood of the true value is finite) we have
$$\frac 1n\mathbb{H}\left(\bar \theta\right) \rightarrow_p E\left[\mathbb{H}\left(\theta_0\right)\right]$$
Moreover, if interchange of integration and differentiation is valid (which usually will be), then
$$E\left[\mathbb{S}\left( \theta_0\right)\right]=\mathbf 0$$
This, together with the assumption that our data is i.i.d, permits us to use the Lindeberg-Levy CLT and conclude that
$$\left[\frac 1{\sqrt n}\mathbb{S}\left( \theta_0\right)\right] \rightarrow_d N(\mathbf 0, \Sigma),\qquad \Sigma = E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]$$
and then, by applying Slutzky's Theorem, that
$$\Rightarrow \sqrt n\left(\hat \theta-\theta_0\right) \rightarrow_d N\left(\mathbf 0, \operatorname{Avar}\right)$$
with
$$\operatorname{Avar} = \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}\cdot \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)\cdot \Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1}$$
But the information matrix equality states that
$$-\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big) = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)$$
and so
$$\operatorname{Avar} = -\Big(E\left[\mathbb{H}\left(\theta_0\right)\right]\Big)^{-1} = \Big(E\left[\mathbb{S}\left( \theta_0\right)\mathbb{S}\left( \theta_0\right)'\right]\Big)^{-1}$$
Then for large samples the distribution of $\hat \theta$ is approximated by
$$\hat \theta \sim _{approx} N\left(\theta_0, \frac 1n\operatorname {\widehat Avar}\right)$$
for a consistent estimator for $\operatorname {\widehat Avar}$ (the sample analogues of the expected values involved are such consistent estimators).
Your question is easy to answer if you are not too serious about $\theta_i\in[0,1]$. Is $\theta_i\in(0,1)$ good enough? Let's say it is. Then, instead of maximizing the likelihood function $L(\theta)$ in $\theta$, you are going to do a change of variables, and instead you maximize the likelihood function $L(\alpha)=L(\theta(\alpha))$ in $\alpha$.
What's $\theta(\alpha)$, you ask? Well, if $\theta$ is a $K$ dimensional vector, then we let $\alpha$ be a $(K-1)$ dimensional vector and set:
\begin{align}
\theta_1 &= \frac{\exp(\alpha_1)}{1+\sum exp(\alpha_k)} \\
\theta_2 &= \frac{\exp(\alpha_2)}{1+\sum exp(\alpha_k)} \\
&\vdots\\
\theta_{K-1} &= \frac{\exp(\alpha_{K-1})}{1+\sum exp(\alpha_k)} \\
\theta_K &= \frac{1}{1+\sum exp(\alpha_k)} \\
\end{align}
After you substitute $\alpha$ into your likelihood function, you can maximize it unconstrained. The $\alpha$ can be any real number. The $\theta(\alpha)$ function magically imposes all your constraints on $\theta$. So, now the usual theorems proving consistency and aymptotic normality of the MLE follow.
What about $\theta$, though? Well, after you have estimated the $\alpha$, you just substitute them into the formulas above to get your estimator for $\theta$. What is the distribution of $\theta$? It is asymptotically normal with mean $\theta_0$, the true value of $\theta$, and variance $V(\hat{\theta})=\frac{\partial \theta}{\partial \alpha}' \ V(\hat{\alpha}) \frac{\partial \theta}{\partial \alpha}$.
As you say, $V(\hat{\theta})$ won't be full rank. Obviously, it can't be full rank. Why not? Because we know the variance of $\sum \hat{\theta}_i$ has to be zero---this sum is always 1, so its variance must be zero. A non-invertible variance matrix is not a problem, however, unless you are using it for some purpose it can't be used for (say to test the null hypothesis that $\sum \theta_i = 1$). If you are trying to do that, then the error message telling you that you can't divide by zero is an excellent warning that you are doing something silly.
What if you are serious about including the endpoints of your interval? Well, that's much harder. What I would suggest is that you think about whether you are really serious. For example, if the $\theta_i$ are probabilities (and that's what your constraints make me think they are), then you really should not be expecting the usual maximum likelihood procedures to give you correct standard errors.
For example, if $\theta_1$ is the probability of heads and $\theta_2$ is the probability of tails, and your dataset looks like ten heads in a row, then the maximum likelihood estimate is $\hat{\theta}_1=1$ and $\hat{\theta}_2=0$. What's the variance of the maximum likelihood estimator evaluated at this estimate? Zero.
If you want to test the null hypothesis that $\theta_1=0.5$, what do you do? You sure don't do this: "Reject null if $\left|\frac{\hat{\theta}_1-0.5}{\sqrt{\hat{V}(\hat{\theta}_1)}}\right|>1.96$" Instead, you calculate the probability that you get ten heads in a row with a fair coin. If that probability is lower than whatever significance level you picked, then you reject.
Best Answer
Possibility of usage of negative weights depends on the distribution of $W(\boldsymbol{\theta}, \boldsymbol{\alpha})$. For example, let's consider a linear regression model with independent Gaussian noise, so for $i$-th observation we have $$ \theta^0_i = \alpha \theta_i^1 + \varepsilon, \varepsilon \sim \mathcal{N}(0, \sigma^2). $$
We get the log likelihood of the form: $$ l(\boldsymbol{\theta}, \boldsymbol{\alpha}) = -\frac{1}{2 \sigma^2} \sum_{i = 1}^n (\theta^0_i - \alpha \theta_i^1)^2 + c, $$ with $c$ doesn't depend on $\boldsymbol{\alpha}$. This function is quadratic in $\boldsymbol{\alpha}$ and exact analytic solution is available for Maximum Likelihood estimation.
Suppose that we weight our likelihood with $w_i, i = \overline{1, n}$. If some weight is negative, it can be the case that the maximum of weighted likelihood is $+\infty$ (if $\sum_{i = 1}^n w_i (\theta_i^1)^2 < 0$), which doesn't make any sense.
So, allowance of negative weights depends on used statistical model and requires additional attention to ensure that provided solution makes physical sense.