Solved – Statsmodels’ Negative Binomial: after .fit_regularized(), how to turn PMF into PPF to get the discrete values

density functionnegative-binomial-distributionscipystatsmodels

I used the package statsmodels to fit a Negative Binomial to my data. This data contains ~1500 samples with 21 covariates. Since I have overdispersion in my data because my dependent variable (y) is skewed, I used the fit_regularized function (the normal .fit() does not make the numerical solver –newton, nm, cg…- converge).

When I plot the fitted data it looks more like a probability mass function (PMF), because I see that the predicted values are not integer (as it is usually the case with NB), but float numbers. I thought I could use scipy.nbinom.ppf() to turn these floating numbers into an integer value (as I previously did with Poisson and it's mu mean values), but I see that scipy.nbinom.ppf() does not receive a mu, but a p, q, and n parameters. And I can't make sense of these parameters.

If I use q=0.95, p=0.5, and n=predicted values, I get the following plot:

And if I use q=0.5, p=0.35, and n=predicted values, I get the following plot:

It seems that p and q control the shape of the ppf-converted distribution, but I do not understand the rationale behind it. What is p? What is q?

My question is, what is the proper way of turning the "orange" distribution into an discrete one, so that it resembles the original "grey" distribution as much as possible? It does not feel right to just try values until I force the "orange" to look like the "grey", but I don't know a proper way of turning the PMF (if its a PMF) into the PPF. Does anybody know how to do this, or whether I am in the right track? I only want the discrete values as with Poisson.

Thanks!

Best Answer

NegativeBinomial regression is usually done in a different parameterization from the standard negative binomial distribution.

The predicted value is the mean or expected value of the distribution, usually modeled with a log link. The second parameter is for the dispersion. The mean of the negative binomial is similar to Poisson a continuous variable.

The parameters have to be converted from the regression parameterization to the usual distribution parameterization which then can be used, for example, with the scipy.stats distribution.

Specific to statsmodels:

The NegativeBinomial has no conversion function built in. A script and link to a stackoverflow question is here https://github.com/statsmodels/statsmodels/issues/106#issuecomment-43961704.

Since version 0.9.0 statsmodels has a generalized version in NegativeBinomialP which has an extended predict method and a parameter conversion function. which parameter in http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomialP.predict.html and a currently only internally used http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.NegativeBinomialP.convert_params.html

(To fully support these conversion parts is still work in progress in statsmodels)

A notebook to illustrate some of the new countmodel features is here https://gist.github.com/josef-pkt/c932904296270d75366a24ee92a4eb2f

Related Solutions

Solved – Model validation after fitting a negative binomial GLM in R

You might find distplot() from the vcd package useful either for the original data (edit: you can't use it on residuals). This plots Friendly's "negativebinomialness plots" and provides how well the negative binomial model fits
distplot(response, type = "nbinomial", ...)

To obtain the parameters: glm.nb uses the "Gamma mixture of Poisson" representation. It is actually a log-linear model that is fitted, so you should get the mean as $\exp(X\beta)$.

For example, let's say your data come from a negbin with mean 5 and theta of 1 (in the alternative representation as described above). Then you can get the mean estimate simply by

set.seed(10)  
df <- data.frame(y=rnbinom(100,size=1,mu=5))  
m0 <- glm.nb(y~1,data=df)  
m0  
exp(coef(m0))  
m0$theta

which are in this case 5.1 for the mean (pretty close) and 1.6 for the dispersion parameter (pretty far off).

If you fit a model for the conditional mode, you interpret it accordingly as in every other log linear model, see this discussion on stack exchange.

EDIT: If you want to know how to get the mean in a negbin regression model you need to sum up the linear predictor $X\beta$.

For example: I take the quine data and fit

m1 <- glm.nb(Days~Sex,data=quine)

now males are 1 females are 0. To get the mean for males you write

> exp(coef(m1)[1]+coef(m1)[2]*1)  
[1] 17.95455

and for females

> exp(coef(m1)[1]+coef(m1)[2]*0)     
[1] 15.225

Now to get the mean you must weight this with the occurence of all females and males which is

> table(quine$Sex)  
 F  M   
80 66

and hence the mean is

> (80/(66+80))*15.225+(66/(80+66))*17.95455  
[1] 16.45685

This is confirmed by

> nb0 <- glm.nb(Days ~ 1, data = quine)    
> exp(coef(nb0))  
(Intercept)  
[1] 16.4589

(apart from rounding errors).

Solved – How to fit a mixture of Gamma distributions to the PMF of a discrete distribution

If I understand you correctly, you have a vector of numbers $[0, 1, 2,\ldots,M - 1, M]$ with probabilities of seeing each of those value, all of whom sum to 1. You want to find a mixture of $N$ gamma distributions to represent the discrete probability mass function. That being said, what may be the simplest thing to do is to minimize the distance (e.g. squared error) between the empirical discrete PMF and the mixed continuous PMF at that point. You can "estimate" the mixed continuous PMF at $n$ as the average of the mixed continuous CDF at $n - 0.5$ and $n + 0.5$ with the data point at 0 being estimated as value at $0.5$.

Here is an R function example for a mixture of two gammas which assumes that the parameters are passed to it as a list of 5 values ($p, \alpha_1, \theta_1, \alpha_2, \theta_2$) and the Data is a dataframe or matrix of size $M$X$2$ with the PMF. It converges rather slowly, but as you're estimating mixtures anyway, it may get you close to what you want.

Dist <- function(pars, Data){
  p <- pars[[1]]
  A1 <- pars[[2]]
  T1 <- pars[[3]]
  A2 <- pars[[4]]
  T2 <- pars[[5]]
  X0 <- pmax(Data[, 1] - 0.5, 0)
  X1 <- Data[, 1] + 0.5
  PMF <- Data[, 2]
  PMF_C <- 0.5 * (p * (pgamma(X0, shape = A1, scale = T1) + pgamma(X1, shape = A1, scale = T1)) + 
            (1 - p) * (pgamma(X0, shape = A2, scale = T2) + pgamma(X1, shape = A2, scale = T2)))
  return(sum((PMF - PMF_C)^2))
}

Pass that into an optimzer like nloptr and let it rip.

Best Answer

Related Solutions

Solved – Model validation after fitting a negative binomial GLM in R

Solved – How to fit a mixture of Gamma distributions to the PMF of a discrete distribution

Related Question