Could anyone explain to me in detail about maximum likelihood estimation (MLE) in layman's terms? I would like to know the underlying concept before going into mathematical derivation or equation.
Maximum Likelihood Estimation – Maximum Likelihood Estimation (MLE) in Layman Terms
definitionintuitionmathematical-statisticsmaximum likelihoodphilosophical
Related Solutions
Usually, maximum likelihood is used in a parametric context. But the same principle can be used nonparametrically. For example, if you have data consisting in observation from a continuous random variable $X$, say observations $x_1, x_2, \dots, x_n$, and the model is unrestricted, that is, just saying the data comes from a distribution with cumulative distribution function $F$, then the empirical distribution function $$ \hat{F}_n(x) = \frac{\text{number of observations $x_i$ with $x_i \le x$}}{n} $$ the non-parametric maximum likelihood estimator.
This is related to bootstrapping. In bootstrapping, we are repeatedly sampling with replacement from the original sample $X_1,X_2, \dots, X_n$. That is exactly the same as taking an iid sample from $\hat{F}_n$ defined above. In that way, bootstrapping can be seen as nonparametric maximum likelihood.
EDIT (answer to question in comments by @Martijn Weterings)
If the model is $X_1, X_2, \dotsc, X_n$ IID from some distribution with cdf $F$, without any restrictions on $F$, then one can show that $\hat{F}_n(x)$ is the mle (maximum likelihood estimator) of $F(x)$. That is done in What inferential method produces the empirical CDF? so I will not repeat it here. Now, if $\theta$ is a real parameter describing some aspect of $F$, it can be written as a function $\theta(F)$. This is called a functional parameter. Some examples is $$ \DeclareMathOperator{\E}{\mathbb{E}} \E_F X=\int x \; dF(x)\quad (\text{The Stieltjes Integral}) \\ \text{median}_F X = F^{-1}(0.5) $$ and many others. The parameter space is $$\Theta =\left\{ F \colon \text{$F$ is a distribution function on the real line } \right\}$$
By the invariance property (Invariance property of maximum likelihood estimator?) we then find mle's by $$ \widehat{\E_F X} = \int x \; d\hat{F}_n(x) \\ \widehat{\text{median}_F X}= \hat{F}_n^{-1}(0.5). $$ It should be clearer now. We don't (as you ask about) use the empirical distribution function to define the likelihood, the likelihood function is completely nonparametric, and the $\hat{F}_n$ is the mle. The bootstrap is then used to describe the variability/uncertainty in mle's of $\theta(F)$'s of interest by resampling (which is simple random sampling from the $\hat{F}_n$.)
EDIT In the comment thread many seems to disbelieve this (which really is a standard result!) result. So trying to make it clearer. The likelihood function is nonparametric, the parameter is $F$, the unknown cumulative distribution function. For a given cutoff point in $\mathbb{R}$, a function of the parameter is $\DeclareMathOperator{\P}{\mathbb{P}} x(F)=F(x)=\P(X \le x)$. A corresponding transformation of the random variable $X$ is $I_x=\mathbb{I}(X\le x)$ which is a Bernoulli random variable with parameter $x(F)$. The maximum likelihood estimate of $x(F)$ based on the sample of $I_x(X_1), \dotsc, I_x(X_n)$ is the usual fraction of $X_i$'s that is lesser or equal to $x$, and the empirical cumulative distribution function expresses this simultaneously for all $x$. Hopes this is clearer now!
You apply a relatively narrow definition of frequentism and MLE - if we are a bit more generous and define
Frequentism: goal of consistency, (asymptotic) optimality, unbiasedness, and controlled error rates under repeated sampling, independent of the true parameters
MLE = point estimate + confidence intervals (CIs)
then it seems pretty clear that MLE satisfies all frequentist ideals. In particular, CIs in MLE, as p-values, control the error rate under repeated sampling, and do not give the 95% probability region for the true parameter value, as many people think - hence they are through and through frequentist.
Not all of these ideas were already present in Fisher's foundational 1922 paper "On the mathematical foundations of theoretical statistics", but the idea of optimality and unbiasedness is, and Neyman latter added the idea of constructing CIs with fixed error rates. Efron, 2013, "A 250-year argument: Belief, behavior, and the bootstrap", summarizes in his very readable history of the Bayesian/Frequentist debate:
The frequentist bandwagon really got rolling in the early 1900s. Ronald Fisher developed the maximum likelihood theory of optimal estimation, showing the best possible behavior for an estimate, and Jerzy Neyman did the same for confidence intervals and tests. Fisher’s and Neyman’s procedures were an almost perfect fit to the scientific needs and the computational limits of twentieth century science, casting Bayesianism into a shadow existence.
Regarding your more narrow definition - I mildly disagree with your premise that minimization of frequentist risk (FR) is the main criterion to decide if a method follows frequentist philosophy. I would say the fact that minimizing FR is a desirable property follows from frequentist philosophy, rather than preceding it. Hence, a decision rule / estimator does not have to minimize FR to be frequentist, and minimizing FR is also does not necessarily say that a method is frequentist, but a frequentist would in doubt prefer minimization of FR.
If we look at MLE specifically: Fisher showed that MLE is asymptotically optimal (broadly equivalent to minimizing FR), and that was certainly one reason for promoting MLE. However, he was aware that optimality did not hold for finite sample size. Still, he was happy with this estimator due to other desirable properties such as consistency, asymptotic normality, invariance under parameter transformations, and let's not forget: ease to calculate. Invariance in particular is stressed abundantly in the 1922 paper - from my reading, I would say maintaining invariance under parameter transformation, and the ability to get rid of the priors in general, were one of his main motivations in choosing MLE. If you want to understand his reasoning better, I really recommend the 1922 paper, it's beautifully written and he explains his reasoning very well.
Best Answer
Say you have some data. Say you're willing to assume that the data comes from some distribution -- perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from (which correspond to the combination of the infinite number of means and variances that a Gaussian distribution can have). MLE will pick the Gaussian (i.e., the mean and variance) that is "most consistent" with your data (the precise meaning of consistent is explained below).
So, say you've got a data set of $y = \{-1, 3, 7\}$. The most consistent Gaussian from which that data could have come has a mean of 3 and a variance of 16. It could have been sampled from some other Gaussian. But one with a mean of 3 and variance of 16 is most consistent with the data in the following sense: the probability of getting the particular $y$ values you observed is greater with this choice of mean and variance, than it is with any other choice.
Moving to regression: instead of the mean being a constant, the mean is a linear function of the data, as specified by the regression equation. So, say you've got data like $x = \{ 2,4,10 \}$ along with $y$ from before. The mean of that Gaussian is now the fitted regression model $X'\hat\beta$, where $\hat\beta =[-1.9,.9]$
Moving to GLMs: replace Gaussian with some other distribution (from the exponential family). The mean is now a linear function of the data, as specified by the regression equation, transformed by the link function. So, it's $g(X'\beta)$, where $g(x) = e^x/(1+e^x)$ for logit (with binomial data).