I have a problem where I need to compute the likelihood of a given sequence of events (in time) given a Poisson process with a particular lambda (events per second). I only need it to within a constant, for comparison to other possible sequences. A likelihood ratio, for example:
p(sequence_1 | lambda) / p(sequence_2 | lambda)
These probabilities could be computed by computing the value of the Poisson distribution over the full time span. For example, if the whole interval is 10 seconds, I could compute the value of the Poisson distribution with 10*lambda and the total number of events.
However, I could also break the full interval down into subsets, such as 1 second intervals. I think the total probability for a sequence ought to be related to the product individual Poisson probabilities for each sub-interval, but I think I must be missing a normalizing factor. Furthermore, at an intuitive level, if all the events were bunched up in time, analyzing over the full time may appear likely while looking at the sub intervals is unlikely.
Is there a way to do this where the result is not a function of the intervals used for processing?
EDIT: For clarification, my goal is to check a sequence against the Poisson process and see how well it matches. I need the score to be in units of probability such that if the score is twice as high, it is twice as likely to be from the specified Poisson.
Best Answer
Consider a chi-squared test.
Before examining the data, you may partition the time interval (which we can, by establishing a suitable origin and unit of time, assume to be $(0,1]$) into $k\ge 2$ subintervals $\mathcal{I}_j=(t_{j-1}, t_{j}]$ with $0=t_0 \lt t_1 \lt \cdots \lt t_k = 1$ and count the numbers of events in each subinterval.
Under the null hypothesis of a homogeneous Poisson process of intensity $\lambda,$ the expected number of events in $\mathcal I_j$ is $\lambda(t_{j} - t_{j-1}).$ The maximum likelihood estimator of $\lambda$ is $n$ itself. The alternative hypothesis of interest is any departure from this condition of homogeneity. Thus, this situation conforms exactly to the description of the chi-squared test at https://stats.stackexchange.com/a/17148/919. The reference chi-squared distribution for computing the p-value has $n-1$ degrees of freedom because one parameter has been estimated.
Given an array
x
of the observed times in $(0,1]$ of any continuous point process, here is anR
function to carry out the chi-squared test.The variable
k
is the number of equal-width subintervals to create. (I leave coding the more general test, with variable length intervals, to anyone who needs this: the code is fussier but not essentially different.) This solution merely tabulates the counts within each subinterval and passes that summary on to the built-inchisq.test
function. The result is an object that includes the chi-squared test statistic and its p-value, which can be extracted via$p.value
(as shown in the code at the end of this post).How effective is this? Here is an example with $n=100$ events. The top row displays (a) the (constant) intensity function, plotting the relative rate $\lambda(x)$ against the time $x;$ (b) one realization of this process; and (c) the sampling distribution of chi-squared p-values (based on ten thousand independent realizations). Because this is the null hypothesis, ideally that distribution is uniform. Clearly it approximates uniformity closely.
The bottom row is a similar array of displays for a non-constant intensity function: that is, a non-homogeneous Poisson process. To my eye, at least, there is no qualitative difference between the illustrated realization of the process and the preceding realization of a homogeneous process: evidently, the difference is subtle. Nevertheless, as the bottom right plot shows, the sampling distribution of p-values is markedly different: they tend to be higher than before, especially in the low range. In other words, this test has appreciable power to determine that this process is not homogeneous.
Comments
This reference chi-squared distribution will work well for large $k$ and/or large $n.$ For small $k,$ to rely on the chi-squared distribution for computing p-values, you need to respect the standard rule of thumb that all of the expected counts are $5$ or greater. If this is not possible, you can resort (as is usual) to permutation testing. For small test sizes and smallish $n$ you will discover that the discreteness of the sampling distribution of the statistic might make the test more rigorous than desired, so make sure to study this possibility before proceeding in such cases.
If you anticipate certain forms of inhomegeneity might occur, it is best to partition time into intervals with approximately equal expected counts. The flexibility afforded by choosing the numbers and widths of these intervals makes the chi-squared test especially appropriate and powerful in this application. However, one thing you do not want to do is experiment with choices of $k$ and the $t_j:$ that form of "data snooping" would yield unrealistically small p-values. It would be OK, though, to use a preliminary dataset (perhaps a random subset of the data) to choose $k$ and the $t_j$ and then apply those to the rest of the data.
Code example
The following code creates the figures and illustrates how
PPGOF
can be used.