Solved – How to test for significance if groups differed at baseline

change-scoreshypothesis testinglogistict-test

I have a situation where I want to test for a statistically significant difference in the percentage of customers who own a product among two groups. The treated group gets a marketing message, and the control group does not. However, these two groups were used for a number of different studies, and since the size of each group was large (~150,000 customers in Treated and ~50,000 customers in Control), we found out afterward that the starting populations had a statistically significant difference in percent of customers who own the product.

What kind of statistical significance test can I do to rectify this situation?

Example data:

TREATED (received marketing, measured product ownership before and after test was conducted)

  • N = 150,000
  • % that own product before marketing = 55.0%
  • % that own the product after marketing = 57.5%

CONTROL (did not receive marketing, but we measured product ownership before and after the test was conducted as well)

  • N = 50,000
  • % that own product before = 53.2% *(statistically different than TREATED's 55%)
  • % that own the product after = 56.2%

In this scenario, the CONTROL group actually has a bigger increase in the percent of customers that own the product, but if I just did the test on the two AFTER groups, it would show a statistically significant result (150,000 customers at 57.5% is statistically different than 50,000 customers at 56.2% if you just plug those numbers into a t-test).

Is there a paired t-test for two groups or something that can account for the initial difference to give the answer to the question of whether or not the marketing had a statistically significant impact on product ownership?

Best Answer

Prelude:
You should not be using any kind of t-test here, or a related linear model. Your response data are Bernoulli (1/0, purchased / did not purchase). For a simple test of two rates without adjusting for covariates, you should use a z-test for the difference of two proportions (see here); for the paired version, you should use McNemar's test (see here & here). Since you have covariates (in one form or another), you wouldn't be able to use either of these, but I mention this for future reference.


Since the groups were initially formed by random assignment, your situation is analogous to people running randomized clinical trails and wanting test for differences in covariates at baseline. That is a very common impulse, and is very intuitive; however, it is incorrect. Given that the groups were randomly assigned, you should not test for differences at baseline. The fact that the groups differed is simply a type I error. Indeed, it is not logically possible for it to be anything else.

That said, given what we know about the customer's baseline status, there is a strong argument that not taking that information into account and concluding that the marketing message boosted sales could also be a type I error (or perhaps a kind of type III error). Clearly, we should take the baseline information into account (even though we should not test the groups' baseline rates against each other). The question then is: How to properly incorporate the baseline information into your test? Given that we want to use additional information, the z-test mentioned above will be insufficient; some form of logistic regression will be appropriate, there are several possibilities:

  1. (We can consider what you could do if you had normally distributed data as an analogy and then move to recommendations that are more specific to your situation.) With just two data per unit (i.e., pre and post), a simple option is to use change scores: you subtract the before from the after score and use that as your response variable. That exact procedure doesn't work with binary data, but an analog might be to use conditional logistic regression. That is like stratifying your sample by their before status and applying logistic regression.

  2. Another possibility, if you had normal data, would be to use a traditional ANCOVA. That is, you would look for differences in after values by group controlling for the before values as a covariate. With binary data, this is just a multiple logistic regression with group and before as variables.

    Over the years, there have been endless debates about whether change scores or ANCOVA is better (it may be worth your time to read through Best practice when analysing pre-post treatment-control designs). If your units (customers) are not equivalent at baseline, the ANCOVA approach can be misleading (a phenomenon known as Lord's paradox), but ANCOVA is typically more powerful. So the general recommendation is to use ANCOVA when your units were randomly assigned, but to use change scores with observational data. Your case is unusual in that your groups were randomly assigned, but nonetheless appear to differ. I suppose my suggestion would be to decide if you believe there was a failure of randomization for some detectable reason, or if the test is a type I error, and then use the appropriate model. Personally, I would be very likely to use the ANCOVA approach unless there was a very strong argument that something went wrong with the randomization.

  3. You could also think of your situation as repeated measures (although the case for this would be stronger if you had more than two measurements per person, in my opinion). If you wanted to think about your situation in this way, there are two options: You could use a GLMM, which models the probability of owning the product after the marketing message conditional on each individual's covariate values, OR

  4. You could use a GEE, which models the population mean proportions. Understanding this distinction is tricky. It may help you to read my answer here: Difference between generalized linear models & generalized linear mixed models in SPSS.

Related Question