1.
$p(x) = {x_i-1 \choose k-1}\pi^k(1-\pi)^{x_i-k}$
$L(\pi;x_i) = \prod_{i=1}^{n}{x_i-1 \choose k-1}\pi^k(1-\pi)^{x_i-k}\\$
$
\ell(\pi;x_i) = \sum_{i=1}^{n}[log{x_i-1 \choose k-1}+klog(\pi)+(x_i-k)log(1-\pi)]\\
\frac{d\ell(\pi;x_i)}{d\pi} = \sum_{i=1}^{n}[\dfrac{k}{\pi}-\dfrac{(x_i-k)}{(1-\pi)}]$
Set this to zero,
$\frac{nk}{\pi}=\frac{\sum_{i=1}^nx_i-nk}{1-\pi}$
$\therefore$ $\hat\pi=\frac{nk}{\sum_{i=1}^nx}$
2.
For second part you need to use the theorem that $\sqrt{n}(\hat\theta-\theta) \overset{D}{\rightarrow}N(0,\frac{1}{I(\theta)})$, $I(\theta)$ is the fisher information here. Therefore,the standard deviation of the $\hat\theta$ will be $[nI(\theta)]^{-1/2}$. Or you call it as standard error since you use CLT here.
So we need to calculate the Fisher information for the negative binomial distribution.
$\frac{\partial^2 \log(P(x;\pi))}{\partial\pi^2}=-\frac{k}{\pi^2}-\frac{x-k}{(1-\pi)^2}$
$I(\theta)=-E(-\frac{k}{\pi^2}-\frac{x-k}{(1-\pi)^2})=\frac{k}{\pi^2}+\frac{k(1-\pi)}{(1-\pi)^2\pi}$
Note: $E(x) =\frac{k}{\pi}$ for the negative binomial pmf
Therefore, the standard error for $\hat \pi$ is $[n(\frac{k}{\pi^2}+\frac{k(1-\pi)}{(1-\pi)^2\pi})]^{-1/2}$
Simplify we get we get $se(\pi)=\sqrt{\dfrac{\pi^2(\pi-1)}{kn}}$
3.
The geometric distribution is a special case of negative binomial distribution when k = 1. Note $\pi(1-\pi)^{x-1}$ is a geometric distribution
Therefore, negative binomial variable can be written as a sum of k independent, identically distributed (geometric) random variables.
So by CLT negative binomial distribution will be approximately normal if the parameter k is large enough
I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices
No, the expectation is based on the model. We're not getting some kind of ensemble-average, we're literally finding an expectation:
$\mathcal{I}(\theta) = - \text{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,.$
(though we might be finding it from a different expression that yields the same quantity).
That is, we do some algebra before we implement it in computation.
We have a single ML estimate, and we're computing the standard error from the second derivative of the likelihood at the peak -- a "sharp" peak means a small standard error, while a broad peak means a large standard error.
You might like to see that when you do this for a normal likelihood (iid observations from $N(\mu,\sigma)$, with $\sigma$ known) that this calculation yields that the Fisher information is $n/\sigma^2$, and hence that the asymptotic variance of the ML estimate of $\mu$ is $\sigma^2/n$, or its standard error is $\sigma/\sqrt{n}$. (Of course in this case, that's also the small-sample variance and standard error.)
Best Answer
Your question is easy to answer if you are not too serious about $\theta_i\in[0,1]$. Is $\theta_i\in(0,1)$ good enough? Let's say it is. Then, instead of maximizing the likelihood function $L(\theta)$ in $\theta$, you are going to do a change of variables, and instead you maximize the likelihood function $L(\alpha)=L(\theta(\alpha))$ in $\alpha$.
What's $\theta(\alpha)$, you ask? Well, if $\theta$ is a $K$ dimensional vector, then we let $\alpha$ be a $(K-1)$ dimensional vector and set:
\begin{align} \theta_1 &= \frac{\exp(\alpha_1)}{1+\sum exp(\alpha_k)} \\ \theta_2 &= \frac{\exp(\alpha_2)}{1+\sum exp(\alpha_k)} \\ &\vdots\\ \theta_{K-1} &= \frac{\exp(\alpha_{K-1})}{1+\sum exp(\alpha_k)} \\ \theta_K &= \frac{1}{1+\sum exp(\alpha_k)} \\ \end{align}
After you substitute $\alpha$ into your likelihood function, you can maximize it unconstrained. The $\alpha$ can be any real number. The $\theta(\alpha)$ function magically imposes all your constraints on $\theta$. So, now the usual theorems proving consistency and aymptotic normality of the MLE follow.
What about $\theta$, though? Well, after you have estimated the $\alpha$, you just substitute them into the formulas above to get your estimator for $\theta$. What is the distribution of $\theta$? It is asymptotically normal with mean $\theta_0$, the true value of $\theta$, and variance $V(\hat{\theta})=\frac{\partial \theta}{\partial \alpha}' \ V(\hat{\alpha}) \frac{\partial \theta}{\partial \alpha}$.
As you say, $V(\hat{\theta})$ won't be full rank. Obviously, it can't be full rank. Why not? Because we know the variance of $\sum \hat{\theta}_i$ has to be zero---this sum is always 1, so its variance must be zero. A non-invertible variance matrix is not a problem, however, unless you are using it for some purpose it can't be used for (say to test the null hypothesis that $\sum \theta_i = 1$). If you are trying to do that, then the error message telling you that you can't divide by zero is an excellent warning that you are doing something silly.
What if you are serious about including the endpoints of your interval? Well, that's much harder. What I would suggest is that you think about whether you are really serious. For example, if the $\theta_i$ are probabilities (and that's what your constraints make me think they are), then you really should not be expecting the usual maximum likelihood procedures to give you correct standard errors.
For example, if $\theta_1$ is the probability of heads and $\theta_2$ is the probability of tails, and your dataset looks like ten heads in a row, then the maximum likelihood estimate is $\hat{\theta}_1=1$ and $\hat{\theta}_2=0$. What's the variance of the maximum likelihood estimator evaluated at this estimate? Zero.
If you want to test the null hypothesis that $\theta_1=0.5$, what do you do? You sure don't do this: "Reject null if $\left|\frac{\hat{\theta}_1-0.5}{\sqrt{\hat{V}(\hat{\theta}_1)}}\right|>1.96$" Instead, you calculate the probability that you get ten heads in a row with a fair coin. If that probability is lower than whatever significance level you picked, then you reject.