Regression Analysis – Creating Anscombe-like Datasets with the Same Box and Whiskers Plot

curve fittingdescriptive statisticsestimatorsregression

EDIT: As this question has been inflated, a summary: finding different meaningful and interpretable datasets with the same mixed statistics (mean, median, midrange and their associated dispersions, and regression).

The Anscombe quartet (see Purpose of visualizing high dimensional data?) is a famous example of four $x$-$y$ datasets, with the same marginal mean/standard deviation (on the four $x$ and the four $y$, separately) and the same OLS linear fit, regression and residual sum of squares, and correlation coefficient $R^2$. The $\ell_2$-type statistics (marginal and joint) are thus the same, while the datasets are quite different.

Anscombe's quartet

EDIT (from OP comments) Leaving the small dataset size apart, let me propose some interpretations. Set 1 can be seen as a standard linear (affine, to be correct) relationship with distributed noise. Set 2 shows a clean relationship which could be the acme of a higher-degree fit. Set 3 shows a clear linear statistical dependence with one outlier. Set 4 is more tricky: the attempt to "predict" $y$ from $x$ seems bound to failure. The design of $x$ may reveal an hysteresis phenomenon with an insufficient range of values, a quantization effect (the $x$ could be quantizied too heavily), or the user has switched the dependent and independent variables.

So $\ell_2$ summary features hide very different behaviors. Set 2 could be better dealt with a polynomial fit. Set 3 with outlier-resistant methods ($\ell_1$ or like), as well as Set 4. One might wonder whether other cost functions or discrepancy indicators could settle, or at least improve dataset discrimination. EDIT (from OP comments): the blog post Curious Regressions state that:

Incidentally, I'm told that Frank Anscombe never revealed just how he
came up with these sets of data. If you think it`s an easy task to get
all of the summary statistics and the regression results the same,
then give it a try!

In Datasets constructed for a purpose similar to that of Anscombe's quartet, several interesting datasets are given, for instance with the same quantile-based histograms. I did not see a mixture of meaningful relationship and mixed statistics.

My question is: are there bivariate (or trivariate, to keep visualization) Anscombe-like datasets such that, in addition to having the same $\ell_2$-type statistics:

  • their plots are interpretable as a relationship between $x$ and $y$, as if one were looking for a law between measurements,
  • they possess the same (more robust) $\ell_1$ marginal properties (same median and median of absolute deviation),
  • they have the same bounding boxes: same min, max (and hence $\ell_\infty$-type mid-range and mid-span statistics).

Such datasets would have the same "box-and-whiskers" plot summaries (with min, max, median, median absolute deviation/MAD, mean and std) on each variable, and would still be quite different in interpretation.

It would be even more interesting if some least absolute regression were the same for the datasets (but maybe I am already asking too much). They could serve as a caveat when talking about robust vs. not robust regression, and help keep in mind Richard Hamming's quote:

The purpose of computing is insight, not numbers

EDIT (from OP comments) Similar issues are dealt with in Generating Data with Identical Statistics but Dissimilar Graphics, Sangit Chatterjee & Aykut Firata, The American Statistician, 2007, or Cloning data: generating datasets with exactly the same multiple linear regression fit, J. Aust. N.-Z. Stat. J. 2009.

In Chatterjee (2007), the purpose is to generate novel $(x,y)$ pairs with the same means and standard deviations from initial dataset, while maximizing different "discrepancy/dissimilarity" objective functions. As these functions can be non-convex or non-differentiable, they use genetic algorithms (GA). Important steps consist in ortho-normalization, which is very consistent with preserving mean and (unit-)variance. The paper's figures (half the paper content) superimpose input and GA output data. My opinion is that GA outputs lose a lot of the original intuitive interpretation.

And technically, neither the median nor the midrange is preserved, and the paper does not mention renormalization procedures that would preserve $\ell_2$, $\ell_1$ and $\ell_\infty$ stats.

Best Answer

To be concrete, I'm considering the problem of creating two datasets each of which suggests a relationship but the relationship of each is different, and yet also have approximately the same:

  • mean x
  • mean y
  • SD x
  • SD y
  • median x
  • median y
  • minimum x
  • minimum y
  • maximum x
  • maximum y
  • median absolute deviation from the median of x
  • median absolute deviation from the median of y
  • coefficients from simple linear regression of y on x

Perhaps this is cheating, but one way to make this problem a lot easier is to use a dataset where the best-fitting line is the x-axis, $\operatorname{mean} y = 0$, and $\min y = -\max y$. Then we can just flip the data vertically to get something suggestive of a clearly distinct distribution but where all the above statistics are preserved.

Consider, for example,

\begin{array}{ccccccccccc} x & 0 & \tfrac{1}{9} & \tfrac{2}{9} & \tfrac{3}{9} & \tfrac{4}{9} & \tfrac{5}{9} & \tfrac{6}{9} & \tfrac{7}{9} & \tfrac{8}{9} & 1 \\ \hline y & -1 & -\tfrac{1}{2} & 0 & \tfrac{1}{2} & 1 & 1 & \tfrac{1}{2} & 0 & -\tfrac{1}{2} & -1 \end{array}

which has a upward-V-shaped graph like this:

graph

Replace $y$ with $-y$ and you get a downward V with all the same statistics, and not just approximately, but exactly.