Solved – Transforming data for chi square — squaring negative value difference scores

chi-squared-testdata transformation

I need to compare proportions of difference scores between 2 (unequal n) samples (males vs. females) on 2 different measures. I want to enter the difference scores into contingency table for chi-square type analysis.

I subtract male-female scores for each of the 2 measures

R1C1 = Male"Yes"MeasureA - Female"Yes"MeasureA 
R1C2 = Male"Yes"MeasureB - Female"Yes"MeasureB 
R2C1 = Male"No"MeasureA - Female"No"MeasureA  
R2C2 = Male"No"MeasureB - Female"No"MeasureB

When I use this method for getting difference scores, not surprisingly, some values are negative, which prevents me from doing chi-square. Is there a way to transform the data to do away with the negative values but preserve the proportions? For example, I was wondering if it would be acceptable to just square all difference scores, and then do chi-square?

So here's a bit more detail. I am investigating a measure of sexual experiences. The original measure asks respondents to indicate whether or not they've experienced a variety of sexual encounters. The survey has 2 parallel versions — one for females (asking about sexual victimization) and one for males (asking about perpetration). research has shown that, when given the original measure, females indicate ~2/3s increased rates of victimization, than males reported rate of perpetration. I have created a modified version of the survey (for both male and female versions) and I have hypothesized that this modified version will decrease the discrepancy rate between female/victims and male/perpetrators rates of responding.

I have an unequal number of males and females. each participant was given both versions of the survey (original and modified), the original was given first. I have collapsed the response data to be dichotimous — so either "yes" ([female]i have been raped/[male]i have raped someone) or "no" ([female]i have never been raped/[male]i have never raped anyone).

So, what i need is a way to determine if the male-female discrepancy ratio on the original measure is significantly different from the male-female discrepancy ratio of the modifed version.

further additional info. I have already run paired sample t-tests and determined that male report rates on the modified versin are significantly higher than on the original — female report rates are not significantly different across versions. So i know that the discrepancy is reduced (because male reports increased and females did not) but I'm looking for a direct way to compare the difference scores/proportions between measures.

Best Answer

You say you did paired t-tests on the original data, before dichotomizing it, and that males increased significantly from the old form to the new but the female change was not significant. Unfortunately, that can not be taken as showing that the male change was bigger than the female change. You need to do an independent-groups t-test on the two sets of change scores. (Better yet, you could replace all the t-tests by confidence intervals for the corresponding means and mean differences, which would give you more information.)

For the dichotomized data, the situation is similar.
You have two contingency table, one for males and one for females.

Males      
        Yes   No     Total
  Yes   Myy   Myn    My.
   No   Mny   Mnn    Mn.

Total   M.y   M.n    M.. = M = total number of Males

Females      
        Yes   No     Total
  Yes   Fyy   Fyn    Fy.
   No   Fny   Fnn    Fn.

Total   F.y   F.n    F.. = F = total number of Females

For each table, the analog of the paired t-test is the McNemar test,
http://en.wikipedia.org/wiki/McNemar%27s_test

I know of no simple standard test of the difference between the changes in endorsement rates, but if all of Myn, Mny, Myy+Mnn, Fyn, Fny, Fyy+Fnn are "large" then an asymptotic test might be justified.

Related Solutions

Using chi square when expected value is 0

Consider a die that has equal probabilities for its six faces. However, the faces are labeled 1, 1, 2, 3, 4, 5. So you have five possible outcomes with respective probabilities $p = (1/3, 1/6, 1/6, 1/6, 1/6).$ Your table will have 'categories' 1, 2, 3, 4, 5, You will ignore the category 6 that would have been possible with a standard die.

Example in R:

set.seed(2021)
x = sample(1:5, 600, rep=T, p = c(2,1,1,1,1)/6) # 600 simulated rolls
t = tabulate(x);  t
[1] 205 101 107 102  85                         # observed face counts
e = c(200, 100, 100, 100, 100);  e
[1] 200 100 100 100 100                         # expected counts; 600p

The chi-squared test has P-value $0.57 > 0.05 = 5\%,$ so the null hypothesis that categories have the probabilities $p$ is not rejected.

q = sum((t-e)^2/e);  q                  
[1] 2.915                         # chi-sq statistic
pv = 1 - pchisq(q, 4); pv
[1] 0.5721492                     # P-value

Similarly, in your study, just suppress the impossible categories. Degrees of freedom for the chi-squared statistic will be the number of remaining categories, minus one.

What test to use to find which proportion is highest in multiple groups of different sample sizes

After the significant chi-squared test on the four groups, you may want to do ad hoc tests comparing Group A with other groups. Because you did not show your actual data, I will show how to do this for fictitious data, which may be somewhat similar to yours.

Suppose you have a table of numbers of sales as follows:

     A    B    C    D
    ------------------  
 M  201  145  143  130
 F  170  152  148  211

In R:

m = c(201, 145, 143, 130)
f = c(170, 152, 148, 211)
TBL = rbind(m, f);  TBL

TBL
  [,1] [,2] [,3] [,4]
m  201  145  143  130
f  170  152  148  211

In R, a chi-squared test of homogeneity rejects the null hypothesis that sales of gender are homogeneous across groups at significance level $0.1%$ with P-value $0.0003 < 0.001 = 0.1\%.$ The Yates continuity correction is declined (parameter 'cor=F') on account of reasonably large counts.

chisq.test(TBL, cor=F)

        Pearson's Chi-squared test

data:  TBL
X-squared = 19.168, df = 3, p-value = 0.0002523

The chi-squared test compares the observed counts in TBL with counts (based on marginal totals) that would be expected under the null hypothesis of homogeneity. The Pearson residuals can show where disagreement of observed and expected counts is greatest; look especially for residuals with largest absolute values. Here it seems that the greatest contribution to the relatively large chi-squared statistic comes from groups A and D.

chisq.test(TBL, cor=F)$resi
       [,1]       [,2]      [,3]      [,4]
m  1.831823  0.3012389  0.377127 -2.540219
f -1.746446 -0.2871989 -0.359550  2.421826

You can do an ad hoc test to compare these two groups by selecting only columns 1 and 4 of TBL. In order to avoid 'false discovery' from repeated analyses on the same data, ad hoc tests should be conducted at a smaller significance level than the main chi-squared test. Here it is clear that Groups A and D differ. Specifically, numbers of sales by males are larger in A and sales by females are larger in D.

chisq.test(TBL[,c(1,4)], cor=F)

        Pearson's Chi-squared test

data:  TBL[, c(1, 4)]
X-squared = 18.41, df = 1, p-value = 1.781e-05

+Note:_ Another, essentially equivalent, version of the chi-squared test in R is 'prop.test' as follows: It shows proportions of sales by makes in each group. The proportions $0.542$ and $0.381$ were shown to be significantly different by the ad hoc chi-squared test above.

t = m+f
prop.test(m, t, cor=F)

        4-sample test for 
        equality of proportions 
        without continuity correction

data:  m out of t
X-squared = 19.168, df = 3, p-value = 0.0002523
alternative hypothesis: two.sided
sample estimates:
    prop 1    prop 2    prop 3    prop 4 
 0.5417790 0.4882155 0.4914089 0.3812317

Best Answer

Related Solutions

Using chi square when expected value is 0

What test to use to find which proportion is highest in multiple groups of different sample sizes

Related Question