Solved – Switch from Modelling a Process using a Poisson Distribution to use a Negative Binomial Distribution

kalman filternegative-binomial-distributionpoisson processstate-space-models

$\newcommand{\P}{\mathbb{P}}$We have a random process that may-or-may-not occur multiple times in a set period of time $T$. We have a data feed from a pre-existing model of this process, that provides the probability of a number of events occurring in the period $0 \leq t < T$. This existing model is old and we need to run live checks on the feed-data for estimation errors. The old model producing the data-feed (which is providing the probability of $n$ events occurring in the time-remaining $t$) is approximately Poisson Distributed.

So to check for anomalies/errors, we let $t$ be the time remaining and $X_t$ be the total number of events to occur in the remaining time $t$. The old model implies the estimates $\P(X_t \leq c)$. So under our assumption $X_t\sim \operatorname{Poisson}(\lambda_{t})$ we have:
$$
\P(X_t \leq c) = e^{-\lambda}\sum_{k=0}^c\frac{\lambda_t^k}{k!}\,.
$$
To derive our event rate $\lambda_t$ from the output of old model (observations $y_{t}$), we use a state space approach and model the state relationship as:
$$
y_t = \lambda_t + \varepsilon_t\quad (\varepsilon_t \sim N(0, H_t))\,.
$$
We filter the observations from the old model, using a state space [constant speed decay] model for the evolution of the $\lambda_t$ to obtain the filtered state $E(\lambda_t|Y_t)$ and flag an anomaly/error in the estimated event frequency from the the feed-data if $E(\lambda_t|Y_t) < y_t$.

This approach works fantastically well at picking up errors in the estimated event counts over the full time-period $T$, but not so well if we want to do the same for another period $0 \leq t < \sigma$ where $\sigma < \frac{2}{3} T$. To get around this, we have decided we now want to switch to use the Negative Binomial distribution so that we assume now $X_t\sim NB(r, p)$ and we have:
$$
\P(X_{t} \leq c) = p^{r}\sum_{k = 0}^c (1 – p)^{k}\binom{k + r -1}{r – 1},
$$
where the parameter $\lambda$ is now replaced by $r$ and $p$. This should be straightforward to implement, but I am having some difficulties with interpretation and thus I have some questions I'd like you to help with:

1. Can we merely set $p = \lambda$ in the negative binomial distribution? If not, why not?

2. Assuming we can set $p = f(\lambda)$ where $f$ is some function, how can we correctly set $r$ (do we need to fit $r$ using past data sets)?

3. Is $r$ dependent on the number of events we expect to occur during a given process?

Addendum to extracting estimates for $r$ (and $p$):

I am aware that if we in fact had this problem reversed, and we had the event counts for each process, we could adopt the maximum likelihood estimator for $r$ and $p$. Of course the maximum likelihood estimator only exists for samples for which the sample variance is larger than the sample mean, but if this was the case we could set the likelihood function for $N$ independent identically distributed observations $k_1, k_2, \ldots, k_{N}$ as:
$$
L(r, p) = \prod_{i = 1}^{N}\P(k_i; r, p),
$$
from which we can write the log-likelihood function as:
$$
l(r, p) = \sum_{i = 1}^{N} \ln(\Gamma(k_i + r)) – \sum_{i = 1}^{N} \ln(k_{i}!) – N\ln(\Gamma(r)) + \sum_{i = 1}^{N} k_i \ln(p) + N r\ln(1 – p).
$$
To find the maximum we take the partial derivatives with respect to $r$ and $p$ and set them equal to zero:
\begin{align*}
\partial_{r} l(r, p) &= \sum_{i = 1}^{N} \psi(k_i + r) – N\psi(r) + N\ln(1 – p), \\
\partial_{p} l(r, p) &= \sum_{i = 1}^{N} k_i\frac{1}{p} – N r \frac{1}{1 – p} \enspace .
\end{align*}
Setting $\partial_{r} l(r, p) = \partial_{p} l(r, p) = 0$ and setting $p = \displaystyle\sum_{i = 1}^{N} \displaystyle\frac{k_i} {(N r + \sum_{i = 1}^{N} k_i)},$ we find:
$$
\partial_{r} l(r, p) = \sum_{i = 1}^{N} \psi(k_i + r) – N \psi(r) + N\ln\left(\frac{r}{r + \sum_{i = 1}^{N} \frac{k_i}{N}}\right) = 0.
$$
This equation cannot be solved for r in closed form using Newton or even EM. However, this is not the case in this situation. Although we could use the past data to get a static $r$ and $p$ this is not really any use as for our process, we need to adapt these parameters in time, like we did using Poisson.

Best Answer

The negative binomial distribution is very much similar to the binomial probability model. it is applicable when the following assumptions(conditions) hold good 1)Any experiment is performed under the same conditions till a fixed number of successes, say C, is achieved 2)The result of each experiment can be classified into one of the two categories, success or failure 3)The probability P of success is the same for each experiment 40Each experiment is independent of all the other. The first condition is the only key differentiating factor between binomial and negative binomial

Related Solutions

Solved – Maximum likelihood estimation for state space models using BFGS

I will give you an answer from my experience developing the R package stsm that is introduced in this document. By default ithe package uses the function optimize, which is a combination of a golden section search and successive parabolic interpolation. Other line search algorithms, such as the zoom line search algorithm will do the job similarly well. The algorithm in function optimize is convenient because it does not require the gradient of the function to be minimized. Here is some pseudo code to compute the optimal step size:

fcn.step <- function(step, model, direction)
{
  trial.pars <- model$pars + step * direction
  -logLik(model, pars = trial.pars)
}
s <- line.search(f = fcn.step, interval = c(0, 1))

fcn.step is the function to be minimized, the only parameter is the step size; the parameters of the model are fixed to the values from the last iteration of the BFGS optimization algorithm. The negative of the log-likelihood function is minimized with respect to the step size; direction is the optimal direction vector computed at the current iteration of the BFGS algorithm, $\tilde{\pi(\psi)}$ in your notation.

It is also important the choice of the upper limit of the interval where the optimal value of the step size s is searched. By default, the package stsm calculates the maximum value of s that is compatible with positive values for the variance parameters updated in equation (3). If this value is lower than $1$ then it is taken as the upper bound passed to the line search algorithm, otherwise this bound is set equal to $1$. You may look at function step.maxsize in the package or the function calc.constraint in this implementation of the BFGS algorithm.

With this approach you can apply the BFGS algorithm without the risk of reaching a local optimum with negative variance parameters. Otherwise, I would recommend you either the L-BFGS-B algorithm, which allows setting a lower bound equal to zero in the parameter space where the algorithm searches for the optimum; or a reparameterization of the model, e.g. variance = theta^2 or variance = exp(theta) and maximize the likelihood with respect to theta.

As regards the EM algorithm, as you mention, it is slow to converge. Strangely enough, it converges slower as it approaches the local optimum. For some insight into this issue and a modified EM algorithm you may look at this document. I am the author of this document, you may send me an e-mail if you have any questions about it or if you want a copy (the link is not always available).

Optimization of the likelihood function of state space models can be hard in practice. Some enhancements to the general-purpose optimization algorithm are recommended. For example, one of the parameters of the model can be concentrated out of the likelihood function; maximum likelihood in the frequency domain enjoys some advantages from a computational point of view and may provide good values to be used as starting values in the optimization of the time domain likelihood function.

Solved – Distribution of inter arrival times in a Poisson process

The discussion in your book is not phrased correctly in some aspects, but first let me address your question about conditioning on an event of probability $0$; something that is explicitly forbidden in the definition of conditional probability in the earlier chapter of your book.

For jointly continuous random variables $X$ and $Y$ with joint pdf $f_{X,Y}(u,v)$, the conditional pdf of $Y$ given that $X = x$ is defined to be $$f_{Y\mid X}(v\mid X = u) = \begin{cases} \displaystyle \frac{f_{X,Y}(u,v)}{f_{X}(u)}, & \text{if }~f_{X}(u)>0,\\0, &\text{otherwise.}\end{cases}$$ where $f_X(u)$ is the (marginal) pdf of $X$. The conditional complementary CDF is $$1-F_{Y\mid X}(t\mid X = u) = P\{Y > t\mid X = u\} = \int_t^\infty f_{Y\mid X}(v\mid X = u) \,\mathrm dv$$ Now, in your application, $P\{X_2 > t\mid X_1 = s\}$ can be calculated directly since we are told that the first arrival occurred at $s$ and are being asked for the conditional probability that no arrivals have occurred in $(s,s+t]$. But, what happens in $(s,s+t]$ is independent of what happened in $(0,s]$ since the time intervals are disjoint. That is, $P\{\text{no arrivals in} ~ (s,s+t]\mid X_1=s\}$ is the same regardless of whether we assume that there was an arrival at $s$ or the first arrival occurred before time $s$, and so $$P\{X_2 > t\mid X_1 = s\} = P\{\text{no arrivals in} ~ (s,s+t]\} = e^{-\lambda t}.$$ and thus we get that the conditional pdf $f_{X_2\mid X_1 = s}(v\mid X_1 = s)$ is the same as the unconditional pdf $f_{X_2}(v) = \lambda e^{-\lambda v}, v > 0$. Conditionally or unconditionally, the distribution of $X_2$ is exponential with parameter $\lambda$. Furthermore, \begin{align} f_{X_2}(v) = f_{X_2\mid X_1 = s}(v\mid X_1 = s) = \displaystyle \frac{f_{X_1,X_2}(s,v)}{f_{X_1}(s)} \implies f_{X_1,X_2}(s,v) = f_{X_1}(s)f_{X_2}(v) \end{align} showing that $X_1$ and $X_2$ are independent (exponential random variables with parameter $\lambda$).

The answers to our specific questions are hidden somewhere in the above.

Best Answer

Related Solutions

Solved – Maximum likelihood estimation for state space models using BFGS

Solved – Distribution of inter arrival times in a Poisson process

Related Question