You could test your Anderson-Darling code using data that is generated from an external library. However, you then run into the issue of how to test/trust the external library. At some point you have to trust that well established libraries are error free and that their output can be relied on.
Once you have the Anderson-Dalring code tested against data generated from an external library the circularity will be broken and you can rely on your own code if it passes the Anderson-Darling tests. The same will hold for the K-S (I presume Kolmogorov–Smirnov) test.
If the treatment is randomly assigned the aggregation won't matter in determining the effect of the treatment (or the average treatment effect). I use lowercase in the following examples to refer to disaggregated items and uppercase to refer to aggregated items. Lets a priori state a model of individual decision making, where $y$ is the outcome of interest, and $x$ represents when an observation recieved the treatment;
$y = \alpha + b_1(x) + b_2(z) + e$
When one aggregates, one is simply summing random variables. So one would observe;
$\sum y = \sum\alpha + \beta_1(\sum x) + \beta_2(\sum z) + \sum e$
So what is to say that $\beta_1$ (divided by its total number of elements, $n$) will equal $b_1$? Because by the nature of random assignment all of the individual components of $x$ are orthogonal (i.e. the variance of $(\sum x)$ is simply the sum of the individual variances), and all of the individual components are uncorrelated with any of the $z$'s or $e$'s in the above equation.
Perhaps using an example of summing two random variables will be more informative. So say we have a case where we aggregate two random variables from the first equation presented. So what we observe is;
$(y_i + y_j) = (\alpha_1 + \alpha_2) + \beta_1(x_i + x_j) + \beta_2(z_i + z_j) + (e_1 + e_2)$
This can subsequently be broken down into its individual components;
$(y_i + y_j) = \alpha_1 + \alpha_2 + b_1(x_i) + b_2(x_j) + b_3(z_i) + b_4(z_j) + e_1 + e_2$
By the nature of random assignment we expect $x_i$ and $x_j$ in the above statement to be independent of all the other parameters ($z_i$, $z_j$, $e_1$, etc.) and each other. Hence the effect of the aggregated data is equal to the effect of the data disaggregated (or $\beta_1$ equals the sum of $b_1$ and $b_2$ divided by two in this case).
This exercise is informative though to see where the aggregation bias will come into play. Anytime the components of that aggregated variable are not independent of the other components you are creating an inherent confound in the analysis (e.g. you can not independently identify the effects of each individual item). So going with your "blue day" scenario one might have a model of individual behavior;
$y = \alpha + b_1(x) + \beta_2(Z) + b_3(x*Z) + e$
Where $Z$ refers to whether the observation was taken on blue day and $x*Z$ is the interaction of the treatment effect with it being blue day. This should be fairly obvious why it would be problematic if you take all of your observations on one day. If treatment is randomly assigned $b_1(x)$ and $\beta_2(Z)$ should be independent, but $b_1(x)$ and $b_3(x*Z)$ are not. Hence you will not be able to uniquely identify $b_1$, and the research design is inherently confounded.
You could potentially make a case for doing the data analysis on the aggregated items (aggregated values tend to be easier to work with and find correlations, less noisy and tend to have easier distributions to model). But if the real questions is to identify $b_1(x)$, then the research design should be structured to appropriately identify it. While I made an argument above for why it does not matter in a randomized experiment, in many settings the argument that all of the individual components are independent is violated. If you expect specific effects on specific days, aggregation of the observations will not help you identify the treatment effect (it is actually a good argument to prolong the observations to make sure no inherent confounds are present).
Best Answer
Really, these approaches have not been actively developed for a very long time. For univariate Outliers, the optimal (most efficent) filter is median+/-$\delta \times$ MAD, or better yet (if you have access to R) median+/-$\delta \times$ Qn (so you don't assume the underlying distribution to be symmetric),
The Qn estimator is implemented in package robustbase.
See:
Rousseeuw, P.J. and Croux, C. (1993) Alternatives to the Median Absolute Deviation, Journal of the American Statistical Association *88*, 1273-1283.
Response to comment:
Two levels.
A) Philosophical.
Both the Dixon and Grub tests are only able to detect a particular type of (isolated, single) outlier. For the last 20-30 years the concept of outliers has involved unto "any observation that departs from the main body of the data". Without further specification of what the particular departure is. This characterization-free approach renders the idea of building tests to detect outliers void. The emphasize shifted to the concept of estimators (a classical example of which is the median) that retain there values (i.e. are insensitive) even for large rate of contamination by outliers -such estimator is then said to be robust- and the question of detecting outliers becomes void.
B) Weakness,
You can see that the Grub and Dixon tests easily break down: one can easily generated contaminated data that would pass either test like a bliss (i.e. without breaking the null). This is particularly obvious in the Grubb test, because outliers will break down the mean and s.d. used in the construction of the test stat. It's less obvious in the Dixon, until one learns that order statistics are not robust to outliers either.
I think you will find more explanation of these facts in papers oriented towards the general non-statistician audience such as the one cited above (I can also think of the Fast-Mcd paper by Rousseeuw). If you consult any recent book/intro to robust analysis, you will notice that neither Grubb nor Dixon are mentioned.