Using chi square when expected value is 0

chi-squared-testr

In a class (I'm the teacher), we are crossing Drosophila with different traits to see if they inherit some characteristics on autosomes or sex chromosomes. In order to do that, we do reciprocal crosses.

When we cross a male with normal wings (NN) x female vestigial wings (nn), all the descendants should have normal wings (Nn) if the gene is located on autosome (in the fly offspring, there would be 50% males and 50% females, and 100% of males would have normal wings and 100% of females would have also normal wings).
Now if we take the 2 descendants (one male and one female) and cross it, we should have 50% males and 50% females, but of these proportions, we should get $3/4 * 1/2 = 3/8 = 0.375 $ or 37.5% males that have normal wings and $1/4 * 1/2 = 1/8 = 0.125 $ or 12.5% males that have vestigial wings (same logic applies for females)

An alternative would be that the gene for wing type is located on the X (sex chromosome) and therefore male with normal wings ($X^NY$) x female vestigial wings ($X^nX^n$) would produce 50% males and 50% females, but 100% of males would have normal wings and 100% of females would have normal wings. Now if we take the 2 descendants (one male ($X^nY$) and one female ($X^nX^n$)) and cross it, we should have 50% males and 50% females, but 1/2 males would have normal wings and 1/2 would have vestigial wings. The same logic applies for females.

We can do the following reasoning, but with the eye as another trait we want to investigate.

Here is a visual summary of what is explained (1 type of cross out of 4) :
enter image description here

Now we tested that experimentally by crossing parents and getting fly offsprings. We bred flies and got the following results :

structure(list(cross = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L), cross.name = c("male normal X females vestigial", 
"male normal X females vestigial", "male normal X females vestigial", 
"male normal X females vestigial", "male vestigial X females normal", 
"male vestigial X females normal", "male vestigial X females normal", 
"male vestigial X females normal", "male red X female white", 
"male red X female white", "male red X female white", "male red X female white", 
"male white X female red", "male white X female red", "male white X female red", 
"male white X female red"), sex = c("male", "male", "female", 
"female", "male", "male", "female", "female", "male", "male", 
"female", "female", "male", "male", "female", "female"), trait = c("wing", 
"wing", "wing", "wing", "wing", "wing", "wing", "wing", "eye", 
"eye", "eye", "eye", "eye", "eye", "eye", "eye"), phenotype = c("normal", 
"vestigial", "normal", "vestigial", "normal", "vestigial", "normal", 
"vestigial", "red", "white", "red", "white", "red", "white", 
"red", "white"), nb.f1 = c(98L, 1L, 70L, 0L, 28L, 0L, 22L, 0L, 
2L, 92L, 109L, 4L, 53L, 0L, 71L, 0L), nb.f2 = c(120L, 43L, 134L, 
50L, 37L, 22L, 47L, 14L, 93L, 82L, 90L, 84L, 72L, 73L, 167L, 
0L), theoretial.f1.autosome = c(50L, 0L, 50L, 0L, 50L, 0L, 50L, 
0L, 50L, 0L, 50L, 0L, 50L, 0L, 50L, 0L), theoretial.f1.sex.chromosome = c(0L, 
50L, 50L, 0L, 50L, 0L, 50L, 0L, 0L, 50L, 50L, 0L, 50L, 0L, 50L, 
0L), theoretial.f2.autosome = c(37.5, 12.5, 37.5, 12.5, 37.5, 
12.5, 37.5, 12.5, 37.5, 12.5, 37.5, 12.5, 37.5, 12.5, 37.5, 12.5
), theoretial.f2.sex.chromosome = c(25L, 25L, 25L, 25L, 25L, 
25L, 50L, 0L, 25L, 25L, 25L, 25L, 25L, 25L, 50L, 0L)), class = "data.frame", row.names = c(NA, 
-16L))

I've been thinking using a chi square to test the association in the data, but when I get to a percentage theoretical value of (expected value) that equals 0, the the chi square doesn't return a value.

What could be used to test which hypothesis is true (if each trait, wing or eye) is on the autosome or sex chromosome?

Based on one answer to this question here is the problem that I face when calculating the chi square:

df= df %>% 
  group_by(cross) %>% 
  mutate(sum.per.cross.f1 = sum(nb.f1),
         sum.per.cross.f2 = sum(nb.f2)) %>% 
  ungroup() 

df.q.chi2= df %>% 
  mutate(exp.nb.f1.auto = sum.per.cross.f1*theoretial.f1.autosome/100,
         exp.nb.f1.sex  = sum.per.cross.f1*theoretial.f1.sex.chromosome/100,
         exp.nb.f2.auto = sum.per.cross.f2*theoretial.f2.autosome/100,
         exp.nb.f2.sex  = sum.per.cross.f2*theoretial.f2.sex.chromosome/100,
         q.1.auto = (nb.f1-exp.nb.f1.auto)^2/exp.nb.f1.auto,
         q.1.sex  = (nb.f1-exp.nb.f1.sex)^2/exp.nb.f1.sex,
         q.2.auto = (nb.f2-exp.nb.f2.auto)^2/exp.nb.f2.auto,
         q.2.sex  = (nb.f2-exp.nb.f2.sex)^2/exp.nb.f2.sex) %>% 
  group_by(cross) %>% 
  select(q.1.auto,
         q.1.sex,
         q.2.auto,
         q.2.sex)

Here is the output :

# A tibble: 16 × 5
# Groups:   cross [4]
   cross q.1.auto q.1.sex q.2.auto  q.2.sex
   <int>    <dbl>   <dbl>    <dbl>    <dbl>
 1     1    2.16  Inf      0.788    12.7   
 2     1  Inf      82.5    0.00324  22.1   
 3     1    2.49    2.49   0.115    25.7   
 4     1  NaN     NaN      1.01     15.6   
 5     2    0.36    0.36   1.42      1.63  
 6     2  NaN     NaN      3.27      2.13  
 7     2    0.36    0.36   0.0889    2.82  
 8     2  NaN     NaN      0.0667  Inf     
 9     3   99.5   Inf     11.0       0.379 
10     3  Inf       1.28  33.8       0.316 
11     3    0.292   0.292 12.8       0.0867
12     3  Inf     Inf     37.4       0.121 
13     4    1.31    1.31  17.3       0.462 
14     4  NaN     NaN     29.6       0.321 
15     4    1.31    1.31  21.4       0.776 
16     4  NaN     NaN     39       NaN     

You can see now that if I 'sum' the 'q' columns to get the 'q' values that I could put in R and find a p-value, that a lot are NaN or Inf… That is my question, how to deal with this (if using the chi square)? Would there be another test that would allow me do this?

If I continue and calculate the p-values, I can make no statistical call on wether one scenario is better explaining the data :

df.q.chi2

df.q.chi2.l = pivot_longer(df.q.chi2,cols = !cross)
df.q.chi2.l.no.na = na.omit(df.q.chi2.l)
df.q.all = df.q.chi2.l.no.na[is.finite(df.q.chi2.l.no.na$value),]
df.q.all %>% 
  group_by(cross,name) %>% 
  summarise(sum.cross = sum(value, na.rm = TRUE),
            nb = n(),
            pv = 1 - pchisq(sum.cross, nb-1),
            sign = ifelse(pv<=0.05,"sg","ns")) %>% 
  mutate(f = substring(name,3,3)) %>% 
  filter(f ==2)

Below would be the table of all the outcomes possible (all are significant, so I would not be able to discriminate if the gene are found on an autosome or a sex chromosome : but when I directly look at the data, it seems possible to distinguish between the 2).

`summarise()` has grouped output by 'cross'. You can override using the `.groups` argument.
# A tibble: 8 × 7
# Groups:   cross [4]
  cross name     sum.cross    nb       pv sign  f    
  <int> <chr>        <dbl> <int>    <dbl> <chr> <chr>
1     1 q.2.auto     1.92      4 5.90e- 1 ns    2    
2     1 q.2.sex     76.1       4 2.22e-16 sg    2    
3     2 q.2.auto     4.84      4 1.84e- 1 ns    2    
4     2 q.2.sex      6.58      3 3.72e- 2 sg    2    
5     3 q.2.auto    94.9       4 0        sg    2    
6     3 q.2.sex      0.903     4 8.25e- 1 ns    2    
7     4 q.2.auto   107.        4 0        sg    2    
8     4 q.2.sex      1.56      3 4.59e- 1 ns    2    

For completion here is the cells where I have an expected number of individuals that is 0.

df %>% 
  mutate(exp.nb.f1.auto = sum.per.cross.f1*theoretial.f1.autosome/100,
         exp.nb.f1.sex  = sum.per.cross.f1*theoretial.f1.sex.chromosome/100,
         exp.nb.f2.auto = sum.per.cross.f2*theoretial.f2.autosome/100,
         exp.nb.f2.sex  = sum.per.cross.f2*theoretial.f2.sex.chromosome/100,
         q.1.auto = (nb.f1-exp.nb.f1.auto)^2/exp.nb.f1.auto,
         q.1.sex  = (nb.f1-exp.nb.f1.sex)^2/exp.nb.f1.sex,
         q.2.auto = (nb.f2-exp.nb.f2.auto)^2/exp.nb.f2.auto,
         q.2.sex  = (nb.f2-exp.nb.f2.sex)^2/exp.nb.f2.sex) %>%  
  select(cross.name,cross, sex, phenotype,nb.f1,nb.f2,exp.nb.f1.auto,
         exp.nb.f1.sex ,
         exp.nb.f2.auto,
         exp.nb.f2.sex )
    

See the line 8 and 16 for example. Taking line 16 as an example, this is 0 simply because when crossing F1 of the eye (if the gene is on a sex chromosome [R for red and r for white], the are no female that should have white eye). The reason is that when crossing the original parents, ($X^rY$ and $X^RX^R$) giving offsprings $X^RX^r$, $X^RY$and breeding only these offsprings together, we get $X^RY$, $X^rY$ for males and $X^RX^R$, $X^RX^r$ for females, so there can only be females with red eyes, no female with white eye.

# A tibble: 16 × 9
   cross sex    phenotype nb.f1 nb.f2 exp.nb.f1.auto exp.nb.f1.sex exp.nb.f2.auto exp.nb.f2.sex
   <int> <chr>  <chr>     <int> <int>          <dbl>         <dbl>          <dbl>         <dbl>
 1     1 male   normal       98   120           84.5           0            130.           86.8
 2     1 male   vestigial     1    43            0            84.5           43.4          86.8
 3     1 female normal       70   134           84.5          84.5          130.           86.8
 4     1 female vestigial     0    50            0             0             43.4          86.8
 5     2 male   normal       28    37           25            25             45            30  
 6     2 male   vestigial     0    22            0             0             15            30  
 7     2 female normal       22    47           25            25             45            60  
 8     2 female vestigial     0    14            0             0             15             0  
 9     3 male   red           2    93          104.            0            131.           87.2
10     3 male   white        92    82            0           104.            43.6          87.2
11     3 female red         109    90          104.          104.           131.           87.2
12     3 female white         4    84            0             0             43.6          87.2
13     4 male   red          53    72           62            62            117            78  
14     4 male   white         0    73            0             0             39            78  
15     4 female red          71   167           62            62            117           156  
16     4 female white         0     0            0             0             39             0  

Best Answer

Consider a die that has equal probabilities for its six faces. However, the faces are labeled 1, 1, 2, 3, 4, 5. So you have five possible outcomes with respective probabilities $p = (1/3, 1/6, 1/6, 1/6, 1/6).$ Your table will have 'categories' 1, 2, 3, 4, 5, You will ignore the category 6 that would have been possible with a standard die.

Example in R:

set.seed(2021)
x = sample(1:5, 600, rep=T, p = c(2,1,1,1,1)/6) # 600 simulated rolls
t = tabulate(x);  t
[1] 205 101 107 102  85                         # observed face counts
e = c(200, 100, 100, 100, 100);  e
[1] 200 100 100 100 100                         # expected counts; 600p

The chi-squared test has P-value $0.57 > 0.05 = 5\%,$ so the null hypothesis that categories have the probabilities $p$ is not rejected.

q = sum((t-e)^2/e);  q                  
[1] 2.915                         # chi-sq statistic
pv = 1 - pchisq(q, 4); pv
[1] 0.5721492                     # P-value

Similarly, in your study, just suppress the impossible categories. Degrees of freedom for the chi-squared statistic will be the number of remaining categories, minus one.