Graph Theory – Does Big Data Have a Ramsey Theory Problem?

applicationsgraph theoryramsey-theorystatistical-inference

I'm erring on the side of conservatism asking here rather than MO, as it is possible this is a complex question.

"Big Data" is the Silicon Valley term for the issues surrounding the huge amounts of data being produced by the global IT structure. Advanced mathematics is starting to pay attention to this, with very early thoughts on topological approaches. For example, see the Wiki here.

But one obvious way to think about patterns in Big Data is as polychromatic colored complete graphs: let the vertices be your data, let the edges represent relations between data, and let the colors be specific relations (which are the objective of Big Data visualization), with some neutral color representing no relation.

By the very definition of Big Data and Ramsey Theory, this virtually guarantees the existence of monochromatic complete subgraphs which may be nothing but spurious relations that must exist because of Ramsey Theory.

I am NOT a graph theorist in any way. So what I am asking specifically is this:

Are there other techniques that can be overlayed on to a graph theoretical approach that add information that the monochromatic subgraphs are real "signal" and not Ramsey noise? Or, am I misunderstanding, and the Ramsey structures are not actually noise?

Best Answer

As requested, I'll post the comment above as an answer:

The OP is right that there are inevitably patterns in large data sets, and in fact often ones of the sort that we want to find. Here are a couple of very common examples.

In statistics, traditionally you do "hypothesis testing" where you try to find evidence for or against the hypothesis that a parameter has a particular value, for example that the mean height of American males is 5'11''. You do this by measuring the mean height of a sample and then seeing if it is "significantly" different from 5'11''. The problem is, if your sample is big enough, it is always "significantly different", because significance, whatever that is, increases with the size of the sample.

Another example is finance, where people called technical analysts look for support and resistance patterns, which is where a price keeps declining after reaching a certain value (say \$20), and then bouncing back up again. This is evidence that people are selling when the price reaches \$20 and taking their profits. However, such patterns also very commonly appear in random walks, so it is often not clear whether they represnt anything real about the financial situation.

It is said that "If you torture the data for long enough, it will confess." Some people use the term Data Mining in a perjorative sense to refer to finding patterns which aren't really there, in a sense that all sufficiently large data sets should contain such patterns just by chance.

One of our main defences against this problem is to split the data randomly into subsets and look for patterns in one subset, then see whether they generalize to the other subsets. In machine learning, people say "training" instead of "fitting a statistical model" or "looking for patterns" and then talk about "validation". The idea is that the validation should show you how your model is likely to perform on unseen data. If your model learns a spurious pattern, then you hope that this will show up as a poor performance on the validation data. Such a model is said to be over-fitted.

Splitting up the whole data set into $k$ subsets and validating a model fitted to one of the subsets on the rest of the data, and doing this $k$ times, once for each subset, is called $k$--fold cross-validation and is a common way of estimating how your model will perform on new data. This is why the Stackexchange statistics site is called Cross Validated.

The motivation for this procedure is that we really want to see how our patterns generlize to unseen data. If they are real patterns, they should show up in unseen data as well. But if we don't have any unseen data yet, we just pretend that some of the data we have got is unseen, and use it to validate the model. And it makes perfect sense, because usually our reason for wanting to fit models/find patterns/learn is to make predictions about new data that we haven't seen yet.

Related Question