Logistic – Exploring Maximum Likelihood and Minimum Distance Estimators in Logit Model

consistencydistancelogisticmaximum likelihood

Suppose I have data $\{y_i,x_i\}_{i=1}^N$, where $x_i\in\{s_1,…,s_K\}$ and follows a discrete uniform distribution. For each realized $x_i$, $y_i$ is generated by the logit model, i.e., $Pr(Y_i=1|x_i)=\frac{exp(\beta_0+x_i\beta_1)}{1+exp(\beta_0+x_i\beta_1)}$. We want to estimate $(\beta_0,\beta_1)$ using data $\{y_i,x_i\}_{i=1}^N$. There are two natural ways of doing it:
one method is the maximum likelihood approach:

$(\widehat{\beta}_0,\widehat{\beta}_1)=\underset{\beta_0,\beta_1}{argmax} \sum_{i=1}^N y_ilog[\frac{exp(\beta_0+x_i\beta_1)}{1+exp(\beta_0+x_i\beta_1)}]+(1-y_i)log[\frac{1}{1+exp(\beta_0+x_i\beta_1)}]$

The other way of doing it is the minimum distance, that is I could first estimate $Pr(Y_i=1|x_i=s_k)$ using $\frac{\sum_{i=1}^N\mathbf{1}(y_i=1,x_i=s_k)}{\sum_{i=1}^N\mathbf{1}(x_i=s_k)}$ for $s\in \{s_1,…,s_K\}$ and denote this estimator as $\widehat{p}_k$. We choose the parameter to minimize the discrepancy between model-implied probability and relative frequency:

$(\widetilde{\beta}_1,\widetilde{\beta}_2)=\underset{\beta_0,\beta_1}{argmin}||[\frac{exp(\beta_0+s_1\beta_1)}{1+exp(\beta_0+s_1\beta_1)},…,\frac{exp(\beta_0+s_K\beta_1)}{1+exp(\beta_0+s_K\beta_1)}]-[\widehat{p}_1,…,\widehat{p}_K] ||^2$.

Intuitively, both estimators should be consistent (converging to the true value that generates our data). But on the other hand, they seem to be doing completely different things, one is trying to make the likelihood as large as possible, and the other is trying to make the model-implied probabilities as close to the relative frequencies as possible. Why the parameter value that maximizes the likelihood is also able to set the distance to zero in the limit? Intuition or formal proof are both welcome.

Best Answer

Let's do some math. Some notation to compact things. I will write $\Lambda_i$ to denote the Logistic distribution function evaluated at observation $i$. I will write $\Lambda(s_j)$ to denote its evaluation at the value $s_j$ of the support of $x$, $j=1,...,K$. I will write $\sum_{i=1}^N\mathbf{1}(x_i=s_j)=n_j$ to denote the frequency of $x_i = s_j$. Finally note that $$\frac{\partial \Lambda(z)}{\partial z}=\Lambda(z)[1-\Lambda(z)].$$

MINIMUM DISTANCE ESTIMATION

The objective function here is

$$\sum_{j=1}^K \big[(1-\Lambda(s_j))-\hat p_j\big]^2$$

The f.o.c with respect to the betas (in abstract) is

$$-2\sum_{j=1}^K \big[1-\Lambda(s_j)-\hat p_j\big]\cdot \Lambda(s_j)[1-\Lambda(s_j)]s_j=0. \tag{1}$$

MAXIMUM LIKELIHOOD

The objective function here is $$\sum_{i=1}^N \big[y_i\ln(1-\Lambda_i) + (1-y_i)\ln\Lambda_i\big]$$

Taking the derivative with respect to $\beta$ and simpifying, we arrive at the first order condition $$\sum_{i=1}^N(1-\Lambda_i-y_i)\cdot x_i=0$$ $$\implies \sum_{i=1}^N(1-\Lambda_i)\cdot x_i=\sum_{i=1}^Ny_ix_i. \tag{2}$$

We can decompose $(2)$ per the values of $x$, in which case all $\Lambda_i$ for which $x_j = s_j$ become equal to $\Lambda(s_j)$. Namely, $(2)$ can be written

$$(2):\sum_{x_i=s_1}(1-\Lambda_i)\cdot s_1+ \cdots + \sum_{x_i=s_K}(1-\Lambda_i)\cdot s_k = \sum_{x_i=s_1}y_is_1+\cdots + \sum_{x_i=s_K}y_is_K$$

Each sum in the left-hand side has identical elements of number $n_j$. On the right hand side, in each sum, some $y_i$ will be zero and some will equal to unity, and we can see that, in each sum we have $y_i = n_j\hat p_j$.

Using these remarks we can write $(2)$ as

$$n_1\cdot [1-\Lambda (s_1)]\cdot s_1 + \cdots + n_K[1-\Lambda(s_K)]\cdot s_K = n_1\cdot \hat p_1s_1+\cdots + n_K\cdot \hat p_Ks_K,$$

and compacting,

$$ \sum_{j=1}^K n_js_j\big[1-\Lambda(s_j)\big]=\sum_{j=1}^Kn_j\hat p_js_j\tag{3}.$$

It will make no difference if we divide by sample size both sides. Writing $\hat q_j$ for the relative frequency of $x=s_j$, and the f.o.c in maximum likelihood is finally

$$ \sum_{j=1}^K \big[1-\Lambda(s_j)-\hat p_j\big]\hat q_js_j= 0. \tag{4}$$

Comparing $(1)$ and $(4)$, I guess you can take it from here.

Related Question