Data Modeling – What Type of Data Model Should Be Used?

binomial distribution

Could someone kindly help me understand how to model my data correctly?

Note: I've significantly rewritten my question. I hope that's ok.

I am testing the performance of different proteins in bacteria. The question that I want to answer is "Which proteins (A, B, C, …) perform better when paired with Protein X compared to Protein Y?"

My data are generated as follows. A bacteria strain is constructed to have one combination of proteins (one of A/B/C and either X/Y). Human cells are infected with those bacteria and their resulting phenotype ("round" or "flat") is recorded. The experiment is repeated in triplicate.

I'm having trouble setting up a data table and choosing the correct type of statistical test to analyze these data. In GraphPad Prism, I've tried setting it up as grouped columns (with biological replicates as sub-columns) and filling in the percentages of cells with one phenotype. However, as EdM commented below, the raw data is binomial so it seems that a "fractions of whole" table would be more appropriate. In this case, I'm not sure how to handle the biological replicates.

Thank you in advance for any help.

Best Answer

With percentage/proportion data from an experiment having combinations of conditions like this, you want to do something logically similar to ANOVA but appropriate for success/failure counts (of cell phenotype in this case). The variance of a proportion depends on the number of cases and the probability of a positive result. Thus you need to perform some form of binomial multiple regression that, unlike standard ANOVA, takes that specific variance structure into account.

Multiple logistic regression, a common choice for such analysis, has been available for GraphPad Prism since version 8.3.0. I'm not sure exactly how multiple logistic regression is implemented in Prism, but here are two things to look out for.

First, when specifying predictors in the regression model you must include interaction terms representing the various combinations of Proteins A/B/C with Proteins X/Y. That's what gives the analysis the logical structure of a 2-way ANOVA table.

Second, instead of just specifying the percentages for each group, you need to let the software know how many cells were examined. Even if you examined exactly the same total number of cells in each case, the software needs to know the total number in each case to provide corresponding error estimates consistent with its assumption of binomial distributions.

The ways to do that differ among software systems. My quick reading of the GraphPad help page suggests that you might have to reformat your data into a "long form," with one row per individual cell indicating its treatments along with its phenotype (outcome) indicated as 1/0.*

I don't see that the uninfected control cells add much to the analysis except for internal quality control. The highest percentage for uninfected cells is numerically lower than the lowest percentage for any treatment combination. Nevertheless, the more flexible structure of a multiple regression versus a standard rectangular ANOVA could make including them possible if you wish.

Once the multiple linear regression is done, the comparisons among specific treatment combinations are analogous to what you would do with ANOVA. They would, however, be based on estimates of model coefficients and associated errors that are more appropriate for this type of outcome data.

*The glm() function in the free R software that I typically use for logistic regression would allow you to specify the numbers of "successes" and "failures" for each of your 18 treatment combinations. Perhaps Prism provides a similar way to enter your data more compactly, but I didn't see that in a quick review of the manual.

Related Solutions

Solved – How should I use prop.test function

If what you mean to test is whether more people reported an increase than the combined number who reported a decrease or no difference (which is what I think you mean) then your first version is closer to the correct one. Your null hypothesis in that case is that people choose 50-50 between "increase" and "no increase, or decrease" and you are open to evidence either way (greater or less than 50% choose increase).

However, you actually are interested in testing the alternative hypothesis that >50% chose it, so you need a one sided test. You can call this explicitly in prop.test by stating that your alternative hypothesis is only for p being greater than 0.5:

prop.test(30,36, p=0.5, "greater")

It's worth pointing out though that there is nothing special about the 0.5 proportion here - why have you chosen it as the cut-off point for your alternative hypothesis? For example, why not have as a null hypothesis that a third of people choose each option? or any other set of probabilities, perhaps based on an experiment of having people fill in the survey having received a placebo. Having said that, there is strong intuitive appeal in the 0.5 cut-off point and there is no doubt your experiment shows statistically significant evidence that more than 50% do report an increase. The only question is, how does this compare to the percentage who report an increase under other circumstances? (which is not what you've asked here, so I won't worry about it).

Solved – What transformation should I use for a bimodal distribution

Your variable binomial is not binomial. Did you mean bimodal?

Try this:

transformed <- abs(binomial - mean(binomial))
shapiro.test(transformed)
hist(transformed)

which produces something close to a slightly censored normal distribution and (depending on your seed)

        Shapiro-Wilk normality test

data:  transformed
W = 0.98961, p-value = 0.1564

In general, arbitrary transformations are difficult to justify. You need a reason for doing this sort of thing, independent of the actual data

Best Answer

Related Solutions

Solved – How should I use prop.test function

Solved – What transformation should I use for a bimodal distribution

Related Question