Analyze a dataset with unequal treatment distribution

biostatisticshypothesis testinginferencemixed modelregression

I’m looking for the best way to analyze a data set. The experiment was conducted at two different locations, X and Y. The experiment at location X was conducted in one year (2014), whereas the one conducted at location Y was repeated in the next year too (total two years – 2015 & 2016). There were a total of four treatments in this study A, B, C, D. Similar treatments (A, B, C) were compared at location Y. Location X did not have treatment A, but had an additional treatment D.
In summary, only treatments B, C were common at both locations. Similar treatments were compared at location Y, and treatment D was only tested at location X.
What’s the best way to analyze this data?

  1. Should data be analyzed separately for each location? For location X, this is better because I can compare treatments B, C and D at location X without introducing any bias, as treatment D was not tested at location Y, so I will have the same number of data points to compare to compare treatments B, C and D. This is also better for location Y because similar treatments were compared and the experiment was repeated in the next year too. I will also be comparing similar number of data points for each treatment if data was analyzed separately at each location.

  2. Since treatments B & C were common at both locations, data should not be analyzed separately at both locations because I will get more data points to compare treatments B & C. The problem is that I will get uneven treatment number distribution in this case because treatment D was only tested at location X, so treatment D has only 23 data points, treatments B & C has 74 (23+25+26) data points, and treatment A has 51 (26 + 25) data points

I'm more inclined towards method 1 but looking for an expert opinion for confirmation. Thanks

enter image description here

Some details about my experiment. I put out plants in the field for a week to treat them with above four treatment, took them back to a controlled environment to count the number of infected leaves and DISCARDED them. I repeated the experiment with fresh plants in the next week. This is not a time series data. Treatments were applied for a week at both locations, so the duration of each treatment was the same. This is a positive count data. Plot size, treatment duration and sample material were identical. The only problem is uneven treatment distribution between 2014 & 2015-2016

Best Answer

The different numbers of cases under different treatments isn't a problem on its own. Although there's a long historical preference in ANOVA for having equal numbers of cases in each treatment, a proper regression model will handle different numbers of cases.

The bigger problem is that you can't distinguish certain types of effects. For example, you can't distinguish differences between locations X and Y from differences between the calendar year 2014 versus years 2015-2016.

One way to deal with this is to model all of the data together and then do post-modeling comparisons of specific situations. That uses all of the data to build the model, but restricts subsequent analysis to comparisons that make sense given the limits of your experimental design.

Code each combination of location/treatment/year in a single 9-level categorical predictor: treatments B, C, D at X in 2014; treatments A, B, C at Y in 2015; treatments A, B, C at Y in 2016. You presumably will be using a Poisson or negative binomial model for the count data, and you should also be controlling for week within the year by including that in your model (for example, with a flexible regression spline). You might consider an interaction between your 9-level categorical predictor and your modeling of week within the year.

After you have fit the model, don't pay much attention to the individual coefficient values. Instead, use post-modeling tools like those provided by the R emmeans package to evaluate combinations of coefficients that provide useful information. For example, did the overall 2015 and 2016 estimates of treatment effects at location Y differ? What is the difference between treatments B and C, combining both locations? Between D and B/C at location X? Between A and B/C at location Y?

Related Question