I am trying to understand the effect of censored data on parameter estimation, in particular with respect to Bayesian modeling. Here is an illustrating example with some code:
Assume you have $N = n+m$ i.i.d observations from an exponential distribution with parameter $\lambda=3$. We will also define a value $L$ such that any observation less than $L$ is censored. With this in mind, the following is the likelihood for the data
\begin{align}
\ell(x|\lambda) &= \prod_{i=1}^mF(x)\prod_{i=1}^nf(x)\\
&=\prod_{i=1}^m(1-e^{-\lambda L})\prod_{i=1}^n\lambda e^{-\lambda x_i}
\end{align}
where $m$ and $n$ are the number of censored and noncensored observations.
We also place a prior such that $\lambda\sim\text{Gamma}(0.01,0.01)$. Now, no matter what the level of censoring is, as long as I have a single uncensored data point my models seems to do well recovering the true $\lambda$. However, as soon as I make the entire data set censored, I cannot estimate $\lambda$ well. This doesn't seem too shocking to me given that you don't have any information other than the censoring point, however, I would have thought that the estimate of $\lambda$ would be whatever the prior mean is, which is not the case. Should you be recovering the prior if you have all censored data?
Here is come code that estimates $\lambda$ in an MCMC. To change the percent of censoring you just need to change the value for prob
in the quantile()
function.
likelihood <- function(x, censor, lambda){
llik1 <- log(1 - exp(-lambda * x[censor == 1]))
llik2 <- log(lambda) - lambda * x[censor == 0]
lik <- sum(llik1) + sum(llik2)
return(lik)
}
prior <- function(lambda, sigma){
prior <- dgamma(lambda, .01, .01, log = TRUE)
return(prior)
}
posterior <- function(x, censor, lambda){
post <- likelihood(x, censor, lambda) + prior(lambda)
return(post)
}
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
S <- 10000
lambda <- rep(NA, S)
lambda[1] <- 1
for(i in 2:S){
# MCMC for lambda
lambda.star = lambda[i-1] + rnorm(1,0)
if(lambda.star < 0){
alpha = 0
}else{
ratio <- exp(posterior(x, censor, lambda.star) - posterior(x, censor, lambda[i-1]))
alpha <- min(1, ratio)
}
if(runif(1) < alpha){
lambda[i] <- lambda.star
}else{
lambda[i] <- lambda[i - 1]
}
}
lambda <- lambda[5000:S]
mean(lambda)
Best Answer
It is not the case that you should recover the prior if all your data is censored; there is still information in the data.
Consider the following situation. The data is i.i.d Gaussian with unknown mean $\mu$ (whether or not we know the variance is irrelevant to this example.) My prior on the mean is that $\mu \sim N(0,10)$. My censoring point is $\gamma = 10$. I draw 100 observations; every one is above the censoring point. Without getting into the details of what my posterior should look like, why, given that I have observed 100 values $> 10$ and none $< 10$, and the underlying distribution is Normal, would I hold to the belief that there's a 50-50 chance that the true mean $< 0$ (which is a straightforward implication of the prior)? For any proposed value of $\mu \leq 10$, there is only a $2^{-100}$ probability that all 100 observations would lie above $\mu$ (thanks to the symmetry of the Gaussian); clearly our posterior should be substantially different than our prior if we want to capture this information.