[Math] On Mathematical Analysis of MathSciNet & MathOverflow

applied-mathematicsjournalspr.probabilityreference-requestst.statistics

This question has two original motivations: mathematical and social.

The mathematical motivation is mainly based on what I have seen about Zipf's law here and there. The Zipf's law simply states that a Zipfian distribution (a variant of power-law probability distribution) provides a good approximation of many types of data corresponding to physical or social phenomena.

The iconic example is the frequency of the words used in a natural language where the Zipf's law indicates that the most frequent word in the language will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

There are plenty of other surprising appearances of Zipf's (and power) law in various real-world situations. For instance, Silagadze in his paper, "Citations and the Zipf-Mandelbrot's law", shows that the number of citations in scientific papers obeys the Zipf's law!

  • These made me think whether Zipf/power-law distributions appear, in other ways, in large mathematical research databases such as Arxiv, MathSciNet, or MathOverflow, say in the number of publications, co-authors, reviews, reputation points, etc.

The social motivation, however, comes from occasional general claims by some mathematicians about how the mathematical society was, is or will be. Statements such as the followings:

  • The mathematical paradigm is slowly shifting from pure mathematics to applied as more and more mathematicians are doing research of applied nature.

  • In comparison with other branches of logic, model theory has the strongest ties with other mathematical disciplines.

  • Due to the urgent need and rapid advancement in AI and computer science, computability and complexity are the most rapidly expanding branches of mathematical logic.

  • The most influential work of a mathematician usually happens before age 40.

Such controversial claims often provoke intense arguments among the mathematicians. The question is how to settle them once for all. Indeed, a rigorous approach towards verification of any such social claim is to mathematically analyze the large mathematical databases such as Arxiv, MathSciNet, or MathOverflow in order to extract general patterns including the distribution of research topics, possible changes in mathematical research fashion and evaluating the interaction and intersection of various mathematical disciplines with each other.

These motivations lead to the following general question:

Question. Have large mathematical databases such as Arxiv, MathSciNet, or MathOverflow been the subject of any social network and database analysis so far? What are examples of published mathematical research about possible mathematical patterns that may exist in them as databases? What patterns are found? (I am particularly interested in the case of Zipf's law and other naturally occurring statistical patterns such as Benford's law.)

It is likely that some journal ranking organizations have already conducted some research along these lines but I haven't seen any outline so far. It is also interesting if the research has been done in a comparative way which allows one to compare the general characteristics of math community with its other counterparts, say physics or biology.


Update. A MathOverflow fellow just emailed me expressing his interest in conducting some statistical research on MathSciNet and MathOverflow databases. He asked whether I know how to get access to the corresponding background data, which I actually don't. I am not even sure if it is publicly accessible or free (particularly in the case of MathSciNet and Arxiv). I also suspect that there must be some non-disclosure rules and restrictions which any corresponding research along these lines should follow. It would be nice if somebody who knows how to get access to the raw material needed for any research concerning these databases, sheds some light on these issues.

Best Answer

Mathoverflow has been studied as a "complex network" in Social achievement and centrality in MathOverflow, by L.V. Montoya, A. Ma, and R.J. Mondragón.
The analysis distinguishes degree centrality (based on the number of edges that a node has), betweenness centrality (which measures the fraction of geodesic paths that pass through a node), closeness centrality (the mean geodesic distance from a node to every other node), and eigenvector centrality (which measures how well connected a node is and how much direct influence it may have over other well connected nodes in the network). Three hypotheses that are tested (the first two pass, the third fails):

  1. A user’s reputation score is closely related to their degree centrality.
  2. The total number of views obtained by a user is related to their eigenvector centrality.
  3. The number of upvotes obtained by a user is related to their closeness centrality.

MathSciNet has been used by Jerrold W. Grossman to analyze the network of collaborations among mathematics in Patterns of Collaboration in Mathematical Research: Apparently, the appropriate popular buzz phrase for mathematicians should be “eight degrees of separation”.
See also Patterns of Research in Mathematics by the same author.