Data Modeling – What Type of Data Model Should Be Used?

binomial distribution

Could someone kindly help me understand how to model my data correctly?

Note: I've significantly rewritten my question. I hope that's ok.

I am testing the performance of different proteins in bacteria. The question that I want to answer is "Which proteins (A, B, C, …) perform better when paired with Protein X compared to Protein Y?"

My data are generated as follows. A bacteria strain is constructed to have one combination of proteins (one of A/B/C and either X/Y). Human cells are infected with those bacteria and their resulting phenotype ("round" or "flat") is recorded. The experiment is repeated in triplicate.

I'm having trouble setting up a data table and choosing the correct type of statistical test to analyze these data. In GraphPad Prism, I've tried setting it up as grouped columns (with biological replicates as sub-columns) and filling in the percentages of cells with one phenotype. However, as EdM commented below, the raw data is binomial so it seems that a "fractions of whole" table would be more appropriate. In this case, I'm not sure how to handle the biological replicates.

Thank you in advance for any help.

Best Answer

With percentage/proportion data from an experiment having combinations of conditions like this, you want to do something logically similar to ANOVA but appropriate for success/failure counts (of cell phenotype in this case). The variance of a proportion depends on the number of cases and the probability of a positive result. Thus you need to perform some form of binomial multiple regression that, unlike standard ANOVA, takes that specific variance structure into account.

Multiple logistic regression, a common choice for such analysis, has been available for GraphPad Prism since version 8.3.0. I'm not sure exactly how multiple logistic regression is implemented in Prism, but here are two things to look out for.

First, when specifying predictors in the regression model you must include interaction terms representing the various combinations of Proteins A/B/C with Proteins X/Y. That's what gives the analysis the logical structure of a 2-way ANOVA table.

Second, instead of just specifying the percentages for each group, you need to let the software know how many cells were examined. Even if you examined exactly the same total number of cells in each case, the software needs to know the total number in each case to provide corresponding error estimates consistent with its assumption of binomial distributions.

The ways to do that differ among software systems. My quick reading of the GraphPad help page suggests that you might have to reformat your data into a "long form," with one row per individual cell indicating its treatments along with its phenotype (outcome) indicated as 1/0.*

I don't see that the uninfected control cells add much to the analysis except for internal quality control. The highest percentage for uninfected cells is numerically lower than the lowest percentage for any treatment combination. Nevertheless, the more flexible structure of a multiple regression versus a standard rectangular ANOVA could make including them possible if you wish.

Once the multiple linear regression is done, the comparisons among specific treatment combinations are analogous to what you would do with ANOVA. They would, however, be based on estimates of model coefficients and associated errors that are more appropriate for this type of outcome data.


*The glm() function in the free R software that I typically use for logistic regression would allow you to specify the numbers of "successes" and "failures" for each of your 18 treatment combinations. Perhaps Prism provides a similar way to enter your data more compactly, but I didn't see that in a quick review of the manual.

Related Question