Probability – Computing Relative Likelihoods of Sequences in Poisson Process

poisson distributionpoisson processprobability

I have a problem where I need to compute the likelihood of a given sequence of events (in time) given a Poisson process with a particular lambda (events per second). I only need it to within a constant, for comparison to other possible sequences. A likelihood ratio, for example:

p(sequence_1 | lambda) / p(sequence_2 | lambda)

These probabilities could be computed by computing the value of the Poisson distribution over the full time span. For example, if the whole interval is 10 seconds, I could compute the value of the Poisson distribution with 10*lambda and the total number of events.

However, I could also break the full interval down into subsets, such as 1 second intervals. I think the total probability for a sequence ought to be related to the product individual Poisson probabilities for each sub-interval, but I think I must be missing a normalizing factor. Furthermore, at an intuitive level, if all the events were bunched up in time, analyzing over the full time may appear likely while looking at the sub intervals is unlikely.

Is there a way to do this where the result is not a function of the intervals used for processing?

EDIT: For clarification, my goal is to check a sequence against the Poisson process and see how well it matches. I need the score to be in units of probability such that if the score is twice as high, it is twice as likely to be from the specified Poisson.

Best Answer

Consider a chi-squared test.

Before examining the data, you may partition the time interval (which we can, by establishing a suitable origin and unit of time, assume to be $(0,1]$) into $k\ge 2$ subintervals $\mathcal{I}_j=(t_{j-1}, t_{j}]$ with $0=t_0 \lt t_1 \lt \cdots \lt t_k = 1$ and count the numbers of events in each subinterval.

Under the null hypothesis of a homogeneous Poisson process of intensity $\lambda,$ the expected number of events in $\mathcal I_j$ is $\lambda(t_{j} - t_{j-1}).$ The maximum likelihood estimator of $\lambda$ is $n$ itself. The alternative hypothesis of interest is any departure from this condition of homogeneity. Thus, this situation conforms exactly to the description of the chi-squared test at https://stats.stackexchange.com/a/17148/919. The reference chi-squared distribution for computing the p-value has $n-1$ degrees of freedom because one parameter has been estimated.

Given an array x of the observed times in $(0,1]$ of any continuous point process, here is an R function to carry out the chi-squared test.

PPGOF <- function(x, k, ...) {
  n <- length(x)
  if (missing(k)) k <- ceiling(sqrt(n))
  if(isTRUE(n < 5*k)) k <- ceiling(n/5)
  if(isTRUE(k < 2)) k <- 2
  chisq.test(tabulate(ceiling(x*k), k), ...)
}

The variable k is the number of equal-width subintervals to create. (I leave coding the more general test, with variable length intervals, to anyone who needs this: the code is fussier but not essentially different.) This solution merely tabulates the counts within each subinterval and passes that summary on to the built-in chisq.test function. The result is an object that includes the chi-squared test statistic and its p-value, which can be extracted via $p.value (as shown in the code at the end of this post).

How effective is this? Here is an example with $n=100$ events. The top row displays (a) the (constant) intensity function, plotting the relative rate $\lambda(x)$ against the time $x;$ (b) one realization of this process; and (c) the sampling distribution of chi-squared p-values (based on ten thousand independent realizations). Because this is the null hypothesis, ideally that distribution is uniform. Clearly it approximates uniformity closely.

Figure

The bottom row is a similar array of displays for a non-constant intensity function: that is, a non-homogeneous Poisson process. To my eye, at least, there is no qualitative difference between the illustrated realization of the process and the preceding realization of a homogeneous process: evidently, the difference is subtle. Nevertheless, as the bottom right plot shows, the sampling distribution of p-values is markedly different: they tend to be higher than before, especially in the low range. In other words, this test has appreciable power to determine that this process is not homogeneous.

Comments

This reference chi-squared distribution will work well for large $k$ and/or large $n.$ For small $k,$ to rely on the chi-squared distribution for computing p-values, you need to respect the standard rule of thumb that all of the expected counts are $5$ or greater. If this is not possible, you can resort (as is usual) to permutation testing. For small test sizes and smallish $n$ you will discover that the discreteness of the sampling distribution of the statistic might make the test more rigorous than desired, so make sure to study this possibility before proceeding in such cases.

If you anticipate certain forms of inhomegeneity might occur, it is best to partition time into intervals with approximately equal expected counts. The flexibility afforded by choosing the numbers and widths of these intervals makes the chi-squared test especially appropriate and powerful in this application. However, one thing you do not want to do is experiment with choices of $k$ and the $t_j:$ that form of "data snooping" would yield unrealistically small p-values. It would be OK, though, to use a preliminary dataset (perhaps a random subset of the data) to choose $k$ and the $t_j$ and then apply those to the rest of the data.

Code example

The following code creates the figures and illustrates how PPGOF can be used.

# Inhomogeneous Poisson process.
# f: [0,1] --> R will be reduced modulo 1.
rPP <- function(n, f=identity) {
  x <- cumsum(rexp(n+1))          # Use exponential waiting times
  f(x[-(n+1)] / x[n+1]) %% 1      # Apply `f` to these homogeneous Poisson events
}

# Describe the simulation study.
n <- 100     # Sample size to simulate
n.sim <- 1e4 # Number of iterations
hypotheses <- list(`0` = identity,                          # Null: homogeneous process
                   `A` = function(x) x * (25 + sin(4*pi*x)))# Alternative process

# Carry out the study.
set.seed(17)
mai <- par("mai"); par(mfrow=c(length(hypotheses), 3), mai=c(.5, .4, .5, .1))
for (h in names(hypotheses)) {
  # Compute and plot the intensity. This works for increasing functions f:[0,1] --> R+.
  f <- function(x) {g <- hypotheses[[h]]; g(x) / g(1)}
  x <- seq(0, 1, length=201)
  y <- diff(x) / diff(f(x))  # Numeric derivatives (intensities)
  plot(x[-1], y, type="l", ylim=c(0, max(y)), xlim=0:1, lwd=2, col="Blue",
       xlab="x", ylab="Intensity", main="Intensity Function")
  abline(h = 1, lty=3) # The reference line is the homogeneous intensity
  
  # Plot an instance of this process.
  PP.plot(rPP(n, f), col=gray(0, alpha=.25), main=bquote(paste("Instance of ", H[.(h)])))
  
  # Estimate the sampling distribution and plot it.
  H <- replicate(n.sim, PPGOF(rPP(n, f))$p.value)
  plot(ecdf(H), xlim=0:1, xlab="p value", lwd=2,
       main=bquote(paste("Sampling Distribution under ", H[.(h)])))
  abline(0:1, col=hsv(0, alpha=0.5), lwd=2) # The reference line is a uniform distribution
}
par(mfrow=c(1,1), mai=mai) # Restore the original plotting environment
Related Question