Maximum Likelihood Estimation – Understanding MLE and Its Confidence Intervals

confidence intervalmaximum likelihoodnonlinear regressionprobit

I'm trying to figure out if I am actually understanding MLE correctly, or at least applying it correctly to my data. My data consists of several patients for which I have some data, which is used in the following equations.

My problem is to find the best fit parameters (three) of a model through MLE for my data. And from what I can read in literature, for my type of example, the log-likelihood function is this:

$$ LLH (D_{50}, m, n) = \sum\limits_{y(i)=1}\log(NTCP(D_{50}, m, n)) + \sum\limits_{y(i)=0}\log(1- NTCP(D_{50}, m, n)) $$

Where ´NTCPandt` is given by:

$$ NTCP = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{t} e^{-\frac{x^2}{2}} dx$$

$$ t = \frac{gEUD – D_{50}}{m\times D_{50}}$$

$D_{50}$ can be values from 0 to in principle infinite. However, it is most likely that the optimal parameter would be in the range of 0-150. $m$ is probably in the range of 0-3, and $gEUD$ is calculated from the equation below.

$$ gEUD = \left(\sum\limits_{i} v_i d_{i}^{\frac{1}{n}}\right)^{n} $$
As stated, I have several patients, and for each patient I have a two column list with values that is used for the $v_i$ and $d_i$ in the above equation to get the $gEUD$. Each patient is also associated with a value of 1 or 0, which tells me if this patient has had any toxicity from treatment (1 = yes, 0 = no). In turn, this is used in the $LLH$ function to divide patients into either one of the sums or the other.

So what I have done so far is to calculate the NTCP (as seen in the equation, which happens to be a sigmoid function given by scipy.stats.norm(0, 1).cdf(t) in python, where t has all the parameters) for all patients. I then create a grid of parameters (D50, m, and n), let's say a (100, 100, 100) grid, which means 1 million calculated NTCP values for each patient (different parameters lead to different NTCPs). So in my case, the NTCP can take values such as 1 or 0 (0% to 100% probability of getting toxicity), which means that in the cases where I have log(0) I will have infinity. So in my python code I check for 0's or 1's in all 1 million calculations for each patient, and combine these indexes (of 0's and 1's) and remove it from the parameter grid for all patients. So in turn I am losing some parameter sets since they won't be calculated otherwise.
After that I take index (0, 0, 0) for all patients, and divide them accordingly to the LLH equation (y(i) = 1 if they have toxicity, or y(0) = 0 if they don't), and then get a LLH for that one parameter set. And this is obviously done for the entire grid for all patients, and I end up with approximately 1 million LLH values (minus the ones that has been removed due to the possible log(0)). I then find the maximum value, and that is supposed to be my MLE, right ?

I have no idea if there is an built-in python function/package that can do this, but nonetheless, I have written this my self. Obviously it takes some time calculating 1 million vales for x-number of patients, but I guess that is how it is.

My problem is, however, that I have data where the least square method has been used to find the best fitted parameters, and if I do this MLE method on the same data, I don't quite get the same result. Should that worry me, or…?

Also, for finding confidence intervals I have read that profile log likelihood should be the way to go, but also bootstrapping. What would be recommended in my case if I actually end up with a list of approximately 1 million LLH values ?

Best Answer

This is not a complete answer to all your various questions but it does help you to get further with your model.

Probit model

What you are describing

$\log \mathcal{L}(D_{50},m,n|y_i) = \sum\limits_{y(i)=1}\log(P(D_{50}, m, n)) + \sum\limits_{y(i)=0}\log(1- P(D_{50}, m, n))$

is a probit model.

Note that

$$\frac{1}{\sqrt{2 \pi}} \int_{-\infty}^t e^{-\frac{x^2}{2}}dx = \Phi(t)$$

is the cumulative distribution function of standard normal distribution.

Your loglikelihood model corresponds to a Bernouilli distribution for each individual $Y_i$:

$$P(Y_i=1|t_i) = 1 - P(Y_i=1|t_i) = \Phi(t_i) $$


Faster estimation method by solving linear part seperately

Note that if you keep $n$ fixed then the parameter $t$ is basically linear:

$$t_i = \frac{x_i - D_{50}}{mD_{50}} = ax_i + b$$

where $x_i = gEUD_i(n)$, $a = \frac{1}{mD_{50}}$, $b = -\frac{1}{m}$.

Now you can put your model in a very standard (GLM) form for which many solution methods (and standard software) exists:

$$P(Y_i=1|t_i) = 1 - P(Y_i=1|t_i) = \Phi(ax_i + b) $$

and

$$f(y_i|x_i) = y_i \Phi(ax_i + b) + (1-y_i) (1-\Phi(ax_i + b))$$

then find which $n$, $a$ and $b$ maximize the likelihood function and you can compute your estimates for $m$ and $D_{50}$ from the estimates for $a$ and $b$.

This (only a grid search for $n$) is a faster solution method for the MLE than solving on a 100x100x100 grid.


  • I then find the maximum value, and that is supposed to be my MLE, right ? Your grid search is valid (if you keep in mind that you check whether the maximum is not on the edge of the grid) but not very fast and scalable.

    Possibly you could use a non-linear optimization method to solve the model without a grid search. I do not know about such methods that are specialized for probit models.

  • I have data where the least square method has been used to find the best fitted parameters, and if I do this MLE method on the same data, I don't quite get the same result.

    It is different to say exactly what is wrong when not knowing the details about the data and the least squares method. But it is very reasonable to expect that the MLE method which you use is gonna be different from least squares. With a probit model you assume that the data is distributed according to a Bernouilli distribution with a least squares model you assume that the data is distributed according to a Gaussian distribution, they are different things and particularly the Gaussian distribution (which does not model discrete values 0 or 1) is not very logical/correct to use and likely deviates from your results with a probit model.

  • for finding confidence intervals ... profile log likelihood ... but also bootstrapping. What would be recommended

    Both methods are not gonna give an exact solution and this will be a matter about how you prefer to choose between pro's and con's

    The profile likelihood is approximately correct near the MLE and is gonna do better when the interval is not too far away from the center, as well when the interaction between the parameters is not to much. You can verify this on plots of the profiles.

    The bootstrapping is neither correct but you can make it as close to the correct CI as you wish given enough computation and time (so the question here is how much time and computer power do you have)

Related Question