Finding MLE for Categorial Distribution with K outcomes and N data points

Considered a categorical distribution, which is a discrete distribution over $K$ outcomes (i.e. $1$ through $K$). The probability of each category is explicitly represented with parameter $\theta_k$. Which we have that $\theta_k \geq 0$ and $\sum_{k=1}^{K} \theta_k = 1$.
Also each observation $\textbf{x}$ is a $1$-of-$K$ encoding, (i.e $\textbf{x}$ a vector where one of the entries is $1$ and the rest are $0$). Under this model, the probability of an observation can be written in the following form: $p(\textbf{x}|\theta)=\prod_{k=1}^{K}\theta_k^{x_k}$.
Assume that the dataset is $\mathcal{D}=\{ \textbf{x}^{(1)},…,\textbf{x}^{(N)} \}$.
Which we denote that the count for each outcome $k$ as $N_k = \sum_{i=1}^N x_k^{(i)}$ and $N = \sum_{k=1}^K N_k$.
Note that each data point is in the $1$-of-$K$ encoding,
(i.e. $x_k^{(i)} = 1$ if the $i$-th datapoint represents an outcome $k$, otherwise $x_k^{(i)} = 0$).
Assume that $N_k > 0$, Find the maximum likelihood estimator (MLE) $\hat{\theta}_k$.

$\textbf{My Attempt:}$
Since, notice that varibales in $\mathcal{D}$ are independent and identically distributed
notice that likelihood function $L(\theta) = p(\text{x}^{(1)},…,\text{x}^{(N)}~|~\theta_1,…,\theta_K) = p(\text{x}^{(1)},…,\text{x}^{(N)}, \theta_1,…,\theta_K)$.
So, the log-likelihood is $l(\theta) = \log\big(p(\text{x}^{(1)},…,\text{x}^{(N)}, \theta_1,…,\theta_K)\big)$.
Since $\sum_{k=1}^{K} \theta_k = 1$. Then, $\theta_K = 1 – \sum_{k=1}^{K-1} \theta_k$.
Then, $$l(\theta) = \sum_{i=1}^{N}\bigg[x_1^{(i)}\log(\theta_1) + \cdots + x_{K-1}^{(i)}\log(\theta_{K-1}) + x_{K}^{(i)}\log\bigg(1-\sum_{k=1}^{K-1}\theta_k\bigg)\bigg]$$
Since, $N_k = \sum_{i=1}^N x_k^{(i)}$ and $N = \sum_{k=1}^K N_k$.
Then,
\begin{align*}
l(\theta) &= \sum_{i=1}^{N}\bigg[x_1^{(i)}\log(\theta_1) + \cdots + x_{K-1}^{(i)}\log(\theta_{K-1})\bigg] + \sum_{i=1}^{N}\bigg[x_{K}^{(i)}\log\bigg(1-\sum_{k=1}^{K-1}\theta_k\bigg)\bigg] \\
&= \bigg[N_1\log(\theta_1) + \cdots + N_{K-1}\log(\theta_{K-1})\bigg] + N_K\log\bigg(1-\sum_{k=1}^{K-1}\theta_k\bigg)
\end{align*}
Then, by Lagrange multiplier,
we can set $f(\theta_1,…,\theta_{K-1})=N_1\log(\theta_1) + \cdots + N_{K-1}\log(\theta_{K-1})$ and $g(\theta_1,…,\theta_{K-1}) = N_K\log\bigg(1-\sum_{k=1}^{K-1}\theta_k\bigg)$.
Also, $\lambda = 1$ which that $l(\theta)=f(\theta_1,…,\theta_{K-1})+\lambda g(\theta_1,…,\theta_{K-1})$.
So, the $\nabla l(\theta) = \big(\frac{\partial l}{\partial \theta_1},…,\frac{\partial l}{\partial \theta_{K-1}}, \frac{\partial l}{\partial \lambda}\big)$.
Which we have $\nabla l(\theta) = \bigg(\frac{N_1}{\theta_1} – \frac{N_K}{1-\sum_{k=1}^{K-1}\theta_k},…, \frac{N_{K-1}}{\theta_{K-1}} – \frac{N_K}{1-\sum_{k=1}^{K-1}\theta_k}, N_K\log\big(1-\sum_{k=1}^{K-1}\theta_k\big)\bigg)$.
Since, $\theta_K = 1 – \sum_{k=1}^{K-1} \theta_k$.
So, $\nabla l(\theta) = \bigg(\frac{N_1}{\theta_1} – \frac{N_K}{\theta_K},…, \frac{N_{K-1}}{\theta_{K-1}} – \frac{N_K}{\theta_K}, N_K\log\big(1-\sum_{k=1}^{K-1}\theta_k\big)\bigg)$.
Then, setting all the terms to $0$.
We have that
\begin{align*}
\frac{N_1}{\theta_1} – \frac{N_K}{\theta_K} = 0 &\implies N_1\theta_K – N_K\theta_1 = 0 \\
&~~~~~~~\vdots \\
\frac{N_{K-1}}{\theta_{K-1}} – \frac{N_K}{\theta_K} = 0 &\implies N_{K-1}\theta_K – N_K\theta_{K-1} = 0
\end{align*}
Since, we assumed that $N_k > 0$. So, $N_K\log\big(1-\sum_{k=1}^{K-1}\theta_k\big) = 0$ $\implies \log\big(1-\sum_{k=1}^{K-1}\theta_k\big) = 0 \implies 1-\sum_{k=1}^{K-1}\theta_k = 1 \implies \theta_K = 1$.
So, sub the $\theta_K = 1$ to above equations, we have the maximum likelihood estimators
$\hat{\theta}_1=\frac{N_1}{N_K}$, $\cdots$, $\hat{\theta}_{K-1}=\frac{N_{K-1}}{N_K}$ and $\hat{\theta}_K = 1$.

But I don't feel this is correct, since I think the correct answer should be $\hat{\theta}_k=\frac{N_k}{N_1+\cdots+N_K}$ for each $k \in \{1,…,K\}$.
So, where did I do wrong and what should I fix to be right ?

Best Answer

There is no need to use Lagrange multipliers; you've already encoded the constraint explicitly by getting rid of $\theta_K$ at the beginning.

Starting with the $l(\theta)$ expression before you mention Lagrange multipliers, setting the partial derivatives with respect to $\theta_1, \ldots, \theta_{K-1}$ to zero leads to the $K-1$ equations $\frac{N_i}{\theta_i} = \frac{N_K}{1-\sum_{k=1}^{K-1} \theta_k}$ for $i=1,\ldots,K-1$ which can be arranged to $$\theta_i = \frac{N_i}{N_K} (1 - \sum_{k=1}^{K-1} \theta_k).$$ Summing over $i$ leads to $$\sum_{i=1}^{K-1} \theta_i = \frac{N_1 + \cdots + N_{K-1}}{N_K}(1 - \sum_{i=1}^{K-1} \theta_i)$$ which implies $\sum_{i=1}^{K-1} \theta_i = \frac{N_1 + \cdots + N_{K-1}}{N_1 + \cdots + N_K}$, and thus $$\theta_i = \frac{N_i}{N_K}\left(1-\frac{N_1 + \cdots + N_{K-1}}{N_1 + \cdots + N_K}\right) = \frac{N_i}{N_1 + \cdots + N_K}.$$

Best Answer

Related Solutions

[Math] Sufficient Statistics, MLE and Unbiased Estimators of Uniform Type Distribution

Posterior for Pareto distribution with Jeffreys prior