Solved – Use of a negative binomial model for fitting alternative splicing event

generalized linear modelgeneticsnegative-binomial-distributionp-value

I am working on RNA-Seq data (on alternative splicing). Let's say I am looking at a particular type of alternative splicing event – exon skipping. For each intron (or junction), I look if it is normally spliced or if there is an exon skipping event happening. I have three biological replicates ( for each junction, I have 3 set of values). To sum it up, my data set looks like this:

Junction 1:
          Rep1    Rep2    Rep3
exonSkip    8      0       0 
normal      12     6       8    

Junction 2:
          Rep1    Rep2    Rep3
exonSkip    5      9       8 
normal      58     60      44    
....
.... 

My objective is, for each junction, from these replicates, to find out if that particular exon-skipping event is statistically significant. Initially, I summed up all the values for exonSkip and normal separately (in the first case, 8 and 26) and then concluded there are at least 2 exonSkip events. However, I came to know its not the best and that there are better ways.

1) From the literature (on gene expression), I came to know that the biological replicates should be used to obtain an estimate. Since these are reads = count data, and they happen to have a high degree of dispersion among replicates, a negative binomial distribution is suggested.

So, I used a glm.nb model from the MASS library as follows:

# R-code 
# Junction 1   
require(MASS)  
dat1 <- data.frame(y=as.numeric(c(8,0,0,12,6,8)), 
                   exonSkip=as.factor(c("yes","yes","yes","no","no","no")))  
out1 <- glm.nb( y ~ exonSkip, data=dat1)  
summary(out1)
# Junction 2   
dat2 <- data.frame(y=as.numeric(c(5,9,8,58,60,44)), 
                   exonSkip=as.factor(c("yes","yes","yes","no","no","no")))  
out2 <- glm.nb( y ~ exonSkip, data=dat2)  
summary(out2)

For Junction 1: I got p=0.143 for exonSkipyes.

Question: Does this mean that the nb model fit can not be trusted?

For Junction 2: I got p<2e-16. However, the test glm.nb gave a

*warning: In theta.ml ... Iteration limit reached*. 

Question: Is this okay?

Now, from the model, I then used the predict function to estimate the values for exonSkip and normal from these 3 replicates with the model fit.

exp(predict(out1, dat1))  
# Result: 2.6667 and 8.6667 = 3 and 9  
exp(predict(out2, dat2))  
# Result: 7.333 and 54.000 = 7 and 54  

Questions:

  1. Is this method of estimating, assuming negative binomial glm model right? Particularly, the use of predict function.
  2. Can I still use predict (as I have used above) in case of a non-significant fit? If not, then how else can I get an estimate?

2) Even if manage to get the estimate, I have again 2 numbers: 1 for exonSkip and other for Normal. From here, I would like to obtain a measure (or p-value) of how significant it is. How I can go about this?

I think the p-value from the glm.nb is how significant the model fits the data…
One way I thought of is: If there were totally X+Y (X = total exonSkip and Y = total Normal) events, then from a sample of b events, if I get a exonSkip, then it would follow a hypergeometric distribution and I could obtain the p-value as,

sum(dhyper(a:b, X, Y, b))

Is this correct?

I'd appreciate any feedback.

Best Answer

Some comments first:

  1. You need a testable hypothesis before getting p-values. That means that you need to describe what the data would look like without and with the effect of interest. You can do parameter estimation without hypothesis testing, but then you need to specify what parameter is of interest, and there are no p-values.
  2. While your values are counts, they represent the number of reads falling into each category with a fixed total per experimental run. Or more precisely, the total is random, but you want to condition on it (you keep saying things like "out of a sample of b events"). If so, your counts do not have a negative binomial distribution, but rather a binomial distribution.

The following assumes such a conditioning on the total number of reads per experimental run. The right analysis approach still depends on whether your goal is to estimate hte proportion of exon-skipping reads for each junction (with a confidence interval) or do you want to be able to estimate the entire distribution of the number of exon-skipping reads given the total number of reads. You seem to be asking for the latter, but I am not sure whether that's what you really need, as it is not clear what are you planning to do with the results of the analysis.

You have very little information to inform the model, so you have to make major assumptions. Based on a preliminary analysis using quasi-binomial regression it appears that there might be overdispersion: the probability of an exon-skipping read varies somewhat between the replicates. If that's true, binomial regression cannot be used, but you could consider beta-binomial regression. Here I will show the binomial regression.

First, you have to set up the data so that counts from the same replicate are in the same row.

d1 <- data.frame(Rep=1:3, Skip=c(8,0,0), Normal=c(12,6,8))

And then we can use a binomial glm:

m1 <- glm(cbind(Skip,Normal) ~ 1, data=d1, family=binomial)
summary(m1)



Call:
glm(formula = cbind(Skip, Normal) ~ 1, family = binomial, data = d1)

Deviance Residuals: 
     1       2       3  
 1.634  -1.794  -2.072  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -1.1787     0.4043  -2.915  0.00355 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10.18  on 2  degrees of freedom
Residual deviance: 10.18  on 2  degrees of freedom
AIC: 15.613

Number of Fisher Scoring iterations: 4

Note that the p-value shown is probably meaningless to you. It tests whether the parameter is 0, and that corresponds to testing whether the probability of exon-skipping is 0.5.

The predict function can be used to get the predicted probability of exon-skipping for each replicate, and the dbinom function can can be used to get the individual response probabilities:

p1 <- predict(m1, type="response")
> p1
        1         2         3 
0.2352941 0.2352941 0.2352941 
> newtotal <- 10
> dbinom(0:newtotal, size=newtotal, p=p1[1])
 [1] 6.838240e-02 2.104074e-01 2.913333e-01 2.390427e-01 1.287153e-01 4.752565e-02
 [7] 1.218606e-02 2.142605e-03 2.472236e-04 1.690418e-05 5.201286e-07

So, for example, the probability of having 0 exon-skipping reads out of 10 reads at junciton 1 is estimated to be 0.06828.

Related Question