Solved – What are the mathematically rigorous data augmentation techniques

data augmentationdatasetmathematical-statistics

Imagine you have a dataset of 1000 observations. To keep things intuitive imagine they are (x,y) coordinates. They are temporary independent, so that makes it easier.

You wish you had about a million observations, but you only have 1000. How should you generate a million simulated observations?

Are there any proofs that describe the most mathematically precise way to do this?

You want to be true to your original dataset. How do you do that without adding your own bias?

This is a simple problem, and a general one. But I don't know if it's trivial. Seems like it should be.

Best Answer

The reason you "wish you had a million observations" is typically because you want to use the data to to infer something that you don't already know. For example, you might want to fit a model, or make predictions. In this context, the data processing inequality implies that, unfortunately, simulating additional data is less helpful than one might hope (but this doesn't mean it's useless).

To be more specific, let $Y$ be a random vector representing unknown quantities we'd like to learn about, and let $X$ be a random vector representing the data. Now, suppose we simulate new data using knowledge learned from the original data. For example, we might fit a probability distribution to the original data and then sample from it. Let $\tilde{X}$ be a random vector representing the simulated data, and $Z = [X, \tilde{X}]$ represent the augmented dataset. Because $Z$ was generated based on $X$, we have that $Z$ and $Y$ are conditionally independent, given $X$. That is:

$$p(x,y,z) = p(x,y) p(z \mid x)$$

According to the data processing inequality, the mutual information between $Z$ and $Y$ can't exceed that between $X$ and $Y$:

$$I(Z; Y) \le I(X; Y)$$

Since $Z$ contains $X$, this is actually an equality. In any case, this says that, no matter how we try to process the data--including using it to simulate new data)--it's impossible to gain additional information about our quantity of interest (beyond that already contained in the original data).

But, here's an interesting caveat. Note that the above result holds when $\tilde{X}$ is generated based on $X$. If $\tilde{X}$ is also based on some external source $S$, then it may be possible to gain additional information about $Y$ (if $S$ carries this information).

Given the above, it's interesting to note that data augmentation can work well in practice. For example, as Haitao Du mentioned, when training an image classifier, randomly transformed copies of the training images are sometimes used (e.g. translations, reflections, and various distortions). This encourages the learning algorithm to find a classifier that's invariant to these transformations, thereby increasing performance. Why does this work? Essentially, we're introducing a useful inductive bias (similar in effect to a Bayesian prior). We know a priori that the true function ought to be invariant, and the augmented images are a way of imposing this knowledge. From another perspective, this a priori knowledge is the additional source $S$ that I mentioned above.