Statistical Significance – Testing Difference of Groups with Ordinal vs. Continuous Variable

continuous dataordinal-datastatistical significance

This is a question about (1) whether to consider specific data as ordinal/ interval or continuous, and, (2) if the data should be considered ordinal/interval, what the best test is to see whether a treatment and control group are different and how they are different.

I have data from a treatment and control sample about maize seed sales in Africa. 50 different stores were randomly assigned to a treatment and control group and each store has approximately 200 clients purchasing maize. Sales are always done in multiples of 0.25 acres between 0.25 and 2.0 acres. I want to look at whether the treatment caused people purchasing maize in the stores to buy either more or less maize.

In the past, I've always treated the this as a continuous variable, calculated the average maize acres sold for each store, and compared the average maize acreage sold by treatment and control stores using a t-test. Like this:

avg_maize_sales_t <- c(
0.9724265, 0.9318182, 0.8770764, 0.8687500, 0.8556701, 0.8524590, 
0.8448276, 0.8342246, 0.8264249, 0.8257261, 0.7986322, 0.7986111, 0.7975207, 
0.7955882, 0.7833333, 0.7809798, 0.7798507, 0.7785235, 0.7745902, 0.7719780, 
0.7685874, 0.7671990, 0.7604167, 0.7500000, 0.7489712, 0.7465940, 0.7438080,
0.7282322, 0.7257143, 0.7220874, 0.7202381, 0.7079439, 0.7002165, 0.6882022, 
0.6858491, 0.6687307, 0.6553309, 0.6455882, 0.6437126, 0.6403162, 0.6360294, 
0.6305638, 0.6303630, 0.6294788, 0.6200397, 0.6077406, 0.6030220, 0.6021505, 
0.6011561, 0.5926641, 0.5916955, 0.5890288, 0.5667752, 0.5557185, 0.5200000)

avg_maize_sales_c <- c(
0.6783042, 0.6370482, 0.6330935, 0.6305970, 0.6126543, 0.6090116, 
0.6038851, 0.5965517, 0.5925481, 0.5849057, 0.5771812, 0.5727848, 0.5689046, 
0.5681090, 0.5638889, 0.5365385, 0.5258380, 0.5254065, 0.5245098, 0.5238095, 
0.5230769, 0.5181818, 0.5171875, 0.5125786, 0.5089744, 0.5053763, 0.5022936,
0.5012019, 0.5010593, 0.4735169, 0.4696429, 0.4695946, 0.4669312, 0.4605263, 
0.4595436, 0.4587500, 0.4581545, 0.4561404, 0.4528443, 0.4513889, 0.4504717, 
0.4457547, 0.4453125, 0.4434783, 0.4434629, 0.4381188, 0.4375000, 0.4251269, 
0.4150943, 0.3989899, 0.3750000, 0.3692893, 0.3630573, 0.3449612)

t.test(avg_maize_sales_t, avg_maize_sales_c)

Welch Two Sample t-test

data:  t and c
t = 12.764, df = 100.46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1860005 0.2544595
sample estimates:
mean of x mean of y 
0.7226032 0.5023732 

I'm starting to think that this is not the appropriate way to compare the two groups, since maize sales is not a true continuous variable.

The data below shows number of sales for each different maize acreage amount in each group.

treat <- data.frame(
            maize_sale = c(0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2) 
            ,count = c(2279, 5759, 703, 3358, 846, 462, 27, 490)
)

cont <- data.frame(
            maize_sale = c(0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2) 
            ,count = c(2738, 5172, 728, 1516, 268, 120, 10, 104)
)

Question: Is the method I've been using (averaging acreage per store and doing a t-test) appropriate for comparing the difference in acres sold, or should I treat the variable as ordinal / interval and use a different statistical test to see whether there the treatment resulted in more or less maize acres sold per client?

Best Answer

The ordinal/interval distinction matters only with respect to whether you consider mean differences meaningful in your theoretical context. If you do, then a t test of differences between means is fine (given its assumptions). The statistical analysis makes distributional assumptions but no assumptions about the level of measurement but, as I said, it can have implications for your interpretation of a mean difference.

There is not a distinction between ordinal/interval and continuous variables. Continuous variables can be ordinal, interval, or neither. There is a distinction between continuous and discrete variables. The t test does assume normal distributions which means a continuous variable, but violating that assumption by having a discrete variable will not cause a problem if the distribution is not extremely nonnormal.

With a p value as low as yours, you don't have to be concerned at all that the population difference is 0. Even extreme violations of assumptions will not produce a p value that low with any nontrivial probability if the null hypothesis is true.

Related Question