Solved – How to calculate degrees of freedom for chi squared test

bioinformaticsbiostatisticschi-squared-testdegrees of freedom

I am very new to stats, and if the chi squared test is not the best way to handle this, please let me know.

To simplify what I'm trying to do, I have many separate lists of genes. Each list is a different size in terms of number of entries (one gene being one entry). For each list, I have counted the number of times each specific gene appears (that is my observed). So for example, in one specific list, the specific gene KRAS occurs three times (that is entered into a table as my observed). In a different list, it occurs 5× (another observed value).

For my expected, I've pooled all lists together into one master list, and counted the number of times each word appears in the master list, and divided by the total number of entries in the master list to get its overall probability based on the whole data set, then multiplied it back to the original size of any one list to get the expected.

I know how to then calculate the chi squared test statistic, but I don't know how to get the degrees of freedom from this.

I can't even begin to figure out how to put this into a tabular form, but I suppose one way of representing it would be:

      List1   List2   List3   List4   List5 etc. 
GeneA  O/E    O/E     O/E     O/E      O/E 
GeneB  O/E    O/E     O/E     O/E      O/E 
etc.
  • where O/E represents two values, one being observed, the other being expected.

ULTIMATELY, I just want to figure out if individual genes occurring within individual lists deviate from their occurrences within the whole data set (all lists combined). Could someone suggest how I could get the degrees of freedom for chi squared, or if there might be an easier test to use?

Best Answer

What you did and the question you are asking looks like the standard contingency table analysis. The degrees of freedom in this case is $(r-1)(c-1)$ where $r$ is the number of rows (number of different genes) and $c$ is the number of columns (number of lists).

The rule of thumb is that a chi squared ($\chi^2$) test is reasonable if all the expected values are greater than 5. Another rule of thumb says all the expected values need to be greater than 1 and fewer than 20% can be less than 5. If you have any expected values less than 1, or a large proportion less than 5 then a different test would probably be better.