Graph Theory – Does Big Data Have a Ramsey Theory Problem?

applicationsgraph theoryramsey-theorystatistical-inference

I'm erring on the side of conservatism asking here rather than MO, as it is possible this is a complex question.

"Big Data" is the Silicon Valley term for the issues surrounding the huge amounts of data being produced by the global IT structure. Advanced mathematics is starting to pay attention to this, with very early thoughts on topological approaches. For example, see the Wiki here.

But one obvious way to think about patterns in Big Data is as polychromatic colored complete graphs: let the vertices be your data, let the edges represent relations between data, and let the colors be specific relations (which are the objective of Big Data visualization), with some neutral color representing no relation.

By the very definition of Big Data and Ramsey Theory, this virtually guarantees the existence of monochromatic complete subgraphs which may be nothing but spurious relations that must exist because of Ramsey Theory.

I am NOT a graph theorist in any way. So what I am asking specifically is this:

Are there other techniques that can be overlayed on to a graph theoretical approach that add information that the monochromatic subgraphs are real "signal" and not Ramsey noise? Or, am I misunderstanding, and the Ramsey structures are not actually noise?

Best Answer

As requested, I'll post the comment above as an answer:

The OP is right that there are inevitably patterns in large data sets, and in fact often ones of the sort that we want to find. Here are a couple of very common examples.

In statistics, traditionally you do "hypothesis testing" where you try to find evidence for or against the hypothesis that a parameter has a particular value, for example that the mean height of American males is 5'11''. You do this by measuring the mean height of a sample and then seeing if it is "significantly" different from 5'11''. The problem is, if your sample is big enough, it is always "significantly different", because significance, whatever that is, increases with the size of the sample.

Another example is finance, where people called technical analysts look for support and resistance patterns, which is where a price keeps declining after reaching a certain value (say \$20), and then bouncing back up again. This is evidence that people are selling when the price reaches \$20 and taking their profits. However, such patterns also very commonly appear in random walks, so it is often not clear whether they represnt anything real about the financial situation.

It is said that "If you torture the data for long enough, it will confess." Some people use the term Data Mining in a perjorative sense to refer to finding patterns which aren't really there, in a sense that all sufficiently large data sets should contain such patterns just by chance.

One of our main defences against this problem is to split the data randomly into subsets and look for patterns in one subset, then see whether they generalize to the other subsets. In machine learning, people say "training" instead of "fitting a statistical model" or "looking for patterns" and then talk about "validation". The idea is that the validation should show you how your model is likely to perform on unseen data. If your model learns a spurious pattern, then you hope that this will show up as a poor performance on the validation data. Such a model is said to be over-fitted.

Splitting up the whole data set into $k$ subsets and validating a model fitted to one of the subsets on the rest of the data, and doing this $k$ times, once for each subset, is called $k$--fold cross-validation and is a common way of estimating how your model will perform on new data. This is why the Stackexchange statistics site is called Cross Validated.

The motivation for this procedure is that we really want to see how our patterns generlize to unseen data. If they are real patterns, they should show up in unseen data as well. But if we don't have any unseen data yet, we just pretend that some of the data we have got is unseen, and use it to validate the model. And it makes perfect sense, because usually our reason for wanting to fit models/find patterns/learn is to make predictions about new data that we haven't seen yet.

Related Solutions

[Math] Coloring the vertices of $K_n$ and showing existence of a monochromatic triangle

In the case that edges are irrelevant:

Your conjecture is completely correct. Here's how to apply the pigeonhole principle: with four vertices, there may be two of both colors. However, when you add a fifth vertex, it must be either red or blue, which forms a triangle with the other two of the same color. Another way to ask the same kind of question is how many integers does one need to have in order to have three even numbers or three odd numbers?

In the case that edges are relevant:

Your conjecture is almost correct. You can color the outside 5 lines forming a regular pentagon blue and the inner 5 lines (a star) red and yet not have a monochromatic triangle. However, it is true that for $K_n$ with $n \geq 6$ a monochromatic triangle WILL be forced.

Step 1: Take a vertex on $K_6$ and note that at least half of the lines from it must be one color or another. Now color three lines from it either color (say, blue).

Step 2: Apply pigeonhole principle; between the other vertices connected by a blue line to the first one, can the lines be blue or red?

Does that suffice?

[Math] Is Sage on the same level as Mathematica or Matlab for graph theory and graph visualization

Try asking on http://ask.sagemath.org/questions/

In fact, there was already a general question asked there about Sage versus other software, and the top answer said, "If you are doing graph theory or serious number theory, you shouldn't even be asking the question of which package to use." That is, if you're doing graph theory, or serious number theory, Sage is the winner by far. This was comparing Sage to all other computer algebra systems. So, I believe the answer is Sage is the best graph theory program that exists. And, it is getting better all the time.

http://ask.sagemath.org/question/1010/reliability-of-sage-vs-commercial-software

Sage combines together many open source graph theory tools that exist (nauty generator, networkx which contains tons of stuff by itself, cliquer, and more) and also all the things that have been programmed by Sage developers. And, if there is anything you want to be able to do that isn't already programmed, you can program it in Python (or if you need it to be really fast, in cython).

As far as visualization, yes, Sage graphs are extremely good graphics. If you save them to a PDF, you can zoom in as much as you want (I'm not exaggerating) and they will still be crisp, clean graphics. And, there is a graph editor that allows you to draw graphs and move the vertices around and add vertices and edges and things like that. Not to mention, there are many built in graphs.

Oh, and by the way, if Mathematica (or one of many other programs) is installed on the same computer as Sage, you can use the functions from those other programs and get the results in Sage.

Here's a video tutorial on graph theory (second video from top). It gives a lot of detail, and includes all the info on graphics such as saving pdfs and using the graph editor to make nice looking graphs.

http://www.sagemath.org/help-video.html

Here's a real quick tutorial on graph theory in Sage:

http://steinertriples.fr/ncohen/tut/Graphs/

Here are the functions available:

    g.add_cycle
    g.add_edge
    g.add_edges
    g.add_path
    g.add_vertex
    g.add_vertices
    g.adjacency_matrix
    g.all_paths
    g.allow_loops
    g.allow_multiple_edges
    g.allows_loops
    g.allows_multiple_edges
    g.am
    g.antisymmetric
    g.automorphism_group
    g.average_degree
    g.average_distance
    g.bipartite_color
    g.bipartite_sets
    g.blocks_and_cut_vertices
    g.bounded_outdegree_orientation
    g.breadth_first_search
    g.canonical_label
    g.cartesian_product
    g.categorical_product
    g.category
    g.center
    g.centrality_betweenness
    g.centrality_closeness
    g.centrality_degree
    g.characteristic_polynomial
    g.charpoly
    g.check_embedding_validity
    g.check_pos_validity
    g.chromatic_number
    g.chromatic_polynomial
    g.clear
    g.clique_complex
    g.clique_maximum
    g.clique_number
    g.cliques
    g.cliques_containing_vertex
    g.cliques_get_clique_bipartite
    g.cliques_get_max_clique_graph
    g.cliques_maximal
    g.cliques_maximum
    g.cliques_number_of
    g.cliques_vertex_clique_number
    g.cluster_transitivity
    g.cluster_triangles
    g.clustering_average
    g.clustering_coeff
    g.coarsest_equitable_refinement
    g.coloring
    g.complement
    g.connected_component_containing_vertex
    g.connected_components
    g.connected_components_number
    g.connected_components_subgraphs
    g.convexity_properties
    g.copy
    g.cores
    g.cycle_basis
    g.db
    g.degree
    g.degree_constrained_subgraph
    g.degree_histogram
    g.degree_iterator
    g.degree_sequence
    g.degree_to_cell
    g.delete_edge
    g.delete_edges
    g.delete_multiedge
    g.delete_vertex
    g.delete_vertices
    g.density
    g.depth_first_search
    g.diameter
    g.disjoint_routed_paths
    g.disjoint_union
    g.disjunctive_product
    g.distance
    g.distance_all_pairs
    g.distance_graph
    g.dominating_set
    g.dump
    g.dumps
    g.eccentricity
    g.edge_boundary
    g.edge_connectivity
    g.edge_cut
    g.edge_disjoint_paths
    g.edge_disjoint_spanning_trees
    g.edge_iterator
    g.edge_label
    g.edge_labels
    g.edges
    g.edges_incident
    g.eigenspaces
    g.eigenvectors
    g.eulerian_circuit
    g.eulerian_orientation
    g.flow
    g.fractional_chromatic_index
    g.genus
    g.get_boundary
    g.get_embedding
    g.get_pos
    g.get_vertex
    g.get_vertices
    g.girth
    g.gomory_hu_tree
    g.graph6_string
    g.graphics_array_defaults
    g.graphplot
    g.graphviz_string
    g.graphviz_to_file_named
    g.hamiltonian_cycle
    g.has_edge
    g.has_loops
    g.has_multiple_edges
    g.has_vertex
    g.incidence_matrix
    g.independent_set
    g.independent_set_of_representatives
    g.interior_paths
    g.is_bipartite
    g.is_chordal
    g.is_circular_planar
    g.is_clique
    g.is_connected
    g.is_directed
    g.is_drawn_free_of_edge_crossings
    g.is_equitable
    g.is_eulerian
    g.is_even_hole_free
    g.is_forest
    g.is_gallai_tree
    g.is_hamiltonian
    g.is_independent_set
    g.is_interval
    g.is_isomorphic
    g.is_line_graph
    g.is_odd_hole_free
    g.is_overfull
    g.is_perfect
    g.is_planar
    g.is_prime
    g.is_regular
    g.is_split
    g.is_subgraph
    g.is_transitively_reduced
    g.is_tree
    g.is_triangle_free
    g.is_vertex_transitive
    g.kirchhoff_matrix
    g.laplacian_matrix
    g.latex_options
    g.layout
    g.layout_circular
    g.layout_default
    g.layout_extend_randomly
    g.layout_graphviz
    g.layout_planar
    g.layout_ranked
    g.layout_spring
    g.layout_tree
    g.lex_BFS
    g.lexicographic_product
    g.line_graph
    g.longest_path
    g.loop_edges
    g.loop_vertices
    g.loops
    g.matching
    g.matching_polynomial
    g.max_cut
    g.maximum_average_degree
    g.merge_vertices
    g.min_spanning_tree
    g.minimum_outdegree_orientation
    g.minor
    g.modular_decomposition
    g.multicommodity_flow
    g.multiple_edges
    g.multiway_cut
    g.name
    g.neighbor_iterator
    g.neighbors
    g.networkx_graph
    g.num_edges
    g.num_verts
    g.number_of_loops
    g.order
    g.periphery
    g.plot
    g.plot3d
    g.radius
    g.random_edge
    g.random_subgraph
    g.random_vertex
    g.relabel
    g.remove_loops
    g.remove_multiple_edges
    g.rename
    g.reset_name
    g.save
    g.set_boundary
    g.set_edge_label
    g.set_embedding
    g.set_latex_options
    g.set_planar_positions
    g.set_pos
    g.set_vertex
    g.set_vertices
    g.shortest_path
    g.shortest_path_all_pairs
    g.shortest_path_length
    g.shortest_path_lengths
    g.shortest_paths
    g.show
    g.show3d
    g.size
    g.spanning_trees_count
    g.sparse6_string
    g.spectrum
    g.steiner_tree
    g.strong_orientation
    g.strong_product
    g.subdivide_edge
    g.subdivide_edges
    g.subgraph
    g.subgraph_search
    g.subgraph_search_count
    g.subgraph_search_iterator
    g.szeged_index
    g.tensor_product
    g.to_directed
    g.to_simple
    g.to_undirected
    g.topological_minor
    g.trace_faces
    g.transitive_closure
    g.transitive_reduction
    g.traveling_salesman_problem
    g.two_factor_petersen
    g.union
    g.version
    g.vertex_boundary
    g.vertex_connectivity
    g.vertex_cover
    g.vertex_cut
    g.vertex_disjoint_paths
    g.vertex_iterator
    g.vertices
    g.weighted
    g.weighted_adjacency_matrix
    g.wiener_index
    g.write_to_eps

And here are some graphs you can easily generate:

graphs.BalancedTree
graphs.BarbellGraph
graphs.BidiakisCube
graphs.BrinkmannGraph
graphs.BubbleSortGraph
graphs.BuckyBall
graphs.BullGraph
graphs.ButterflyGraph
graphs.ChvatalGraph
graphs.CirculantGraph
graphs.CircularLadderGraph
graphs.ClawGraph
graphs.CompleteBipartiteGraph
graphs.CompleteGraph
graphs.CompleteMultipartiteGraph
graphs.CubeGraph
graphs.CycleGraph
graphs.DegreeSequence
graphs.DegreeSequenceBipartite
graphs.DegreeSequenceConfigurationModel
graphs.DegreeSequenceExpected
graphs.DegreeSequenceTree
graphs.DesarguesGraph
graphs.DiamondGraph
graphs.DodecahedralGraph
graphs.DorogovtsevGoltsevMendesGraph
graphs.DurerGraph
graphs.DyckGraph
graphs.EmptyGraph
graphs.ErreraGraph
graphs.FibonacciTree
graphs.FlowerSnark
graphs.FranklinGraph
graphs.FriendshipGraph
graphs.FruchtGraph
graphs.FuzzyBallGraph
graphs.GeneralizedPetersenGraph
graphs.GoldnerHararyGraph
graphs.Grid2dGraph
graphs.GridGraph
graphs.GrotzschGraph
graphs.HanoiTowerGraph
graphs.HeawoodGraph
graphs.HerschelGraph
graphs.HexahedralGraph
graphs.HigmanSimsGraph
graphs.HoffmanSingletonGraph
graphs.HouseGraph
graphs.HouseXGraph
graphs.HyperStarGraph
graphs.IcosahedralGraph
graphs.IntervalGraph
graphs.KneserGraph
graphs.KrackhardtKiteGraph
graphs.LCFGraph
graphs.LadderGraph
graphs.LollipopGraph
graphs.MoebiusKantorGraph
graphs.MoserSpindle
graphs.MycielskiGraph
graphs.MycielskiStep
graphs.NKStarGraph
graphs.NStarGraph
graphs.OctahedralGraph
graphs.OddGraph
graphs.PappusGraph
graphs.PathGraph
graphs.PetersenGraph
graphs.RandomBarabasiAlbert
graphs.RandomBipartite
graphs.RandomGNM
graphs.RandomGNP
graphs.RandomHolmeKim
graphs.RandomInterval
graphs.RandomLobster
graphs.RandomNewmanWattsStrogatz
graphs.RandomRegular
graphs.RandomShell
graphs.RandomTree
graphs.RandomTreePowerlaw
graphs.ShrikhandeGraph
graphs.StarGraph
graphs.TetrahedralGraph
graphs.ThomsenGraph
graphs.ToroidalGrid2dGraph
graphs.WheelGraph
graphs.WorldMap
graphs.cospectral_graphs
graphs.line_graph_forbidden_subgraphs
graphs.nauty_geng
graphs.trees

Best Answer

Related Solutions

[Math] Coloring the vertices of $K_n$ and showing existence of a monochromatic triangle

[Math] Is Sage on the same level as Mathematica or Matlab for graph theory and graph visualization

Related Question