Goodness of Fit – Evaluating a Distribution Fit Using Log-Loss Function Minimization

censoringgoodness of fitlognormal distributionmaximum likelihoodsurvival

I am trying to fit a log-normal distribution to a time-to-failure data of a product, but the data to which I want to fit the distribution is not regular data. In the data, every row $i$ has two pieces of information about the time to failure.

  1. The time($t_i$) for which the item $i$ was observed.
  2. A binary variable $y_i\in \{0,1\}$ indicating whether the item had failed before time $t_i$ (1 if failed, 0 otherwise).

So, the data looks something like this.

Observed Time($t_i$) failed ? ($y_i$)
5 hrs 1
6 hrs 0
5 hrs 0
7 hrs 1

I emphasize that the time $t_i$ is not the actual failure time. If item $i$ fails ($y_i=1$), $t_i$ is just an upper limit of the failure time. If $i$ doesn't fail ($y_i=0$) then $t_i$ is the lower limit of the failure time.

Now, if we assume all the items are identical and their time to failure is a random variable $\hat t$ that follows a log-normal distribution with parameters $\alpha = (\mu.\sigma)$, then the probability that product $i$ would fail before observation time $t_i$ is $p_i = \Pr(\hat t<t_i) = F(t_i; \mu, \sigma)$ where$F(.)$ is the CDF of log-normal distribution.

So, for every observation $i$, using the probability of failure $p_i$ and the actual label ($y_i$) denoting whether the product failed, we can construct a log-likelihood function like this
$$y_i \ln(p_i) + (1-y_i)\ln(1-p_i) = y_i \ln(F(t_i; \mu, \sigma)) + (1-y_i)\ln(1-F(t_i; \mu, \sigma))$$
Note: This is simlar to loss function of a binary logistic regression

We can define the log-loss for the whole data ($n$ datapoints) as
$$-\sum_{i=1}^n y_i \ln(F(t_i; \mu, \sigma)) + (1-y_i)\ln(1-F(t_i; \mu, \sigma))$$

We can minimize this loss-function over $\mu, \sigma$ and get the estimate of $\mu, \sigma$.

But after finding $\mu, \sigma$, how do I know that the log-normal distribution is a good fit? I don't have the actual values of the observations to do something like a KS test

Best Answer

This is double censoring. You only have an upper limit to the failure times for items with $y_i=1$ (left censored) and a lower limit for those with $y_i=0$ (right censored).

Although you don't have exact failure times, you still can get a nonparametric estimate of the survival curve $S(t)=1-F(t)$ against which to compare your lognormal model. Turnbull proposed an iterative approach in the Journal of the American Statistical Association 69: 169-173 (1974) that takes advantage of the information provided by both the left-censored and right-censored cases, as well as any having known event times.

I don't work with such data, but you can try the tools provided by the interval and icenReg packages in R. Although the names of the packages suggest that they are for interval-censored data (for which you know a time interval during which the event occurred), these notes point out:

Double censored data have some observations left censored, some right censored, and some exact. In a way, this is a subset of interval censored data with the left side of the interval for left-censored observations and the right side of the interval for right-censored observations written as NA or ±Inf.

With that data coding you can use software for interval-censored data to get a nonparametric survival curve, if you don't want to implement the Turnbull method yourself. Those notes illustrate that approach with data from the text by Klein and Moeschberger, maybe the best source for explaining how to deal with all combinations of truncation and censoring in survival analysis.

Related Question