Empirical Risk Minimization – Establishing Connection Between ERM and Maximum Likelihood Estimation

maximum likelihoodoptimization

My question is specifically about section 2.3.5 "Connection to maximum likelihood estimation" from Nielsen (2016) where the connection of Empirical Risk Minimization (ERM) to Maximum likelihood estimation (MLE) is established.

Nielsen (2016) describe that the two concepts of model class (or hypothesis space) and ERM (Vapnik, 1999) enables us to formulate an optimization problem. The model class gives us a restricted set of functions which are possible optimal solutions and ERM gives us a way of deciding between these functions. Model classes can be seen to have a parameter $\theta\in\Theta$ (where $\Theta$ denotes a finite set of parameters trained from given data in our model class or the parameter space) that we want to estimate with $\hat{\theta}$. We can then write the model as:
$$\hat{f}(x)=f(x;\hat{\theta})$$

We are hence assuming a functional form, for example we can write a linear regression as an ERM when $L\left(y,f\left(x\right)\right)=\left(f\left(x\right)-y\right)^2$ and $F$ is the space of linear functions $f=b x$ (Poggio, 2011).

Let us now establish the connection of ERM with maximum likelihood estimation (MLE) as described in Nielsen (2016). With i.i.d. data, we can formulate MLE as an ERM problem with an appropriate loss function. Let us assume Y is from a parametric distribution $Y\sim P_Y\left(y;\theta\right)$ where $\theta\in\Theta$ are the parameters. We can use MLE to estimate $\theta$ (below $l$ denotes the likelihood function). With MLE we want to maximize the likelihood of observing the true population data over the parameter space (Myung, 2003).
$$\hat{\theta}=\underset{\theta\in\Theta}{\text{argmax }}l(\theta;y_1,\dots,y_n)=\underset{\theta\in\Theta}{\text{argmax}}\sum_{i=1}^n\text{log}P_y(y_i;\theta)$$

We can let the parameter $\theta$ depend on $X$ with $\theta:\mathcal{X}\rightarrow\Theta$ and assume that,
$$Y|X\sim\ P_{Y|X}(y;\theta(X))$$

Then,
$$ \hat{\theta}=\underset{\theta\in\Theta^{\mathcal{X}}}{\text{argmin }}
\{\frac1{n}\sum_{i=1}^n -\text{log}P_{Y|X}(y_i;\theta(x_i))\}$$

Hence, we can see the equivalence to the empirical risk minimizer of the loss function:
$$L(y,\theta(x))=-\text{log}P_{Y|X}(y;\theta(x))$$

I am confused about the part where we assume the parameter theta depends on $X$. If we compare the functional form to a linear regression eg. $y=x*b+e$, would it not mean that the value of $b$ depends on $x$? If this is true then it would call into question if this assumption is feasible.

Could someone help regarding the above question? Or just help me establish connection of ERM to MLE. Thank you in advance.

References:
Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of mathematical Psychology, 47(1), 90-100.

Nielsen, D. (2016). Tree boosting with xgboost-why does xgboost win" every" machine learning competition? (Master's thesis, NTNU).

Poggio, T. (2011). The Learning Problem and Regularization.

Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions on neural networks, 10(5), 988-999.

Best Answer

EDIT: I included the comments in the answer.

As you stated "With MLE we want to maximize the likelihood of observing the true population data over the parameter space". The solution to the MLE problem is an estimator $\hat{\theta}$ of the parameter $\theta$ expressed as a function of the observed data $(x_i)$, that is $\hat{\theta} \in \Theta^{\mathcal{X}}$, function from $\mathcal{X} \rightarrow \Theta$.

For example, the MLE estimator of the mean $\mu$ for normally distributed data $X\sim N(\mu,\sigma^2) $is the sample mean $\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i$.

In section 2.3.4 he states "Most model classes will have some parameters $\theta \in \Theta$ that the learning algorithm will adjust to fit the data. In this case, it suffices to estimate the parameters $\hat{\theta}$ in order to estimate the model $\hat{f} = f(x;\hat{\theta})$", which can be done with MLE. In this sense it depends on X.

Does this answer to your question?

Related Question