Solved – Interpreting the difference between lognormal and power law distribution (network degree distribution)

curve fittinglognormal distributionnetworkspower law

First off, I'm not a statistician. However, I have been doing statistical network analysis for my PhD.

As part of the network analysis, I plotted a Complementary Cumulative Distribution Function (CCDF) of network degrees. What I found was that, unlike conventional network distributions (e.g. WWW), the distribution is best fitted by a lognormal distribution. I did try to fit it against a power law and using Clauset et al's Matlab scripts, I found that the tail of the curve follows a power law with a cut-off.

enter image description here

Dotted line represents power law fit.
Purple line represents log-normal fit.
Green line represents exponential fit.

What I'm struggling to understand is what this all mean? I've read this paper by Newman which slightly touches on this topic:
http://arxiv.org/abs/cond-mat/0412004

Below is my wild guess:

If the degree distribution follows a power law distribution, I understand that it means there is linear preferential attachment in the distribution of links and network degree (rich gets richer effect or Yules process).

Am I right in saying that with the lognormal distribution I'm witnessing, there is sublinear preferential attachment at the beginning of the curve and becomes more linear towards the tail where it can be fitted by a power law?

Also, since a log-normal distribution occurs when the logarithm of the random variable (say X) is normally distributed, does this mean that in a log-normal distribution, there are more small values of X and less large values of X than a random variable that follows a power law distribution would have?

More importantly, with regards to network degree distribution, does a log-normal preferential attachment still suggest a scale-free network? My instinct tells me that since the tail of the curve can be fitted by a power law, the network can still be concluded as exhibiting scale-free characteristics.

Best Answer

I think it will be helpful to separate the question into two parts:

  1. What is the functional form of your empirical distribution? and
  2. What does that functional form imply about the generating process in your network?

The first question is a statistics question. If you've applied the methods of Clauset et al. for fitting the power-law distribution and those methods gave you a $p>0.1$ for the upper-tail fit, then you're allowed to say that the upper tail (looking at your figure, this is $x\geq15$ or so) is plausibly power-law distributed. If the methods gave you $p<0.1$ then you can't say that, even if the fit looks good to the eye. Deciding whether the log-normal fit is better means basically doing the same thing. Can you reject that model as a generating process for the degree distribution data you have? If not, then you're allowed to put the log-normal into the "plausible" category.

As a small technical point, degrees are integer quantities, while a log-normal distribution requires a continuous variable, so the two are not really compatible (unless you are only talking about $x\gg1$ when the difference between integers and real values for these kinds of questions becomes negligible). To do the statistics properly, you'd want to write down the pdf for a "log-normally" distributed integer quantity, derive estimators for it and apply those to your data.

The second question is actually harder of the two. As some people pointed out in the comments above, there are many mechanisms that produce power-law distributions and preferential attachment (in all its variations and glory) is just one of many. Thus, observing a power-law distribution in your data (even a genuine one that passes the necessary statistical tests) is not sufficient evidence to conclude that the generating process was preferential attachment. Or, more generally, if you have a mechanism A that produces some pattern X in data (e.g., a log-normal degree distribution in your network). Observing pattern X in your data is not evidence that your data were produced by mechanism A. The data are consistent with A, but that doesn't mean A is the right mechanism.

To really show that A is the answer, you have to test its mechanistic assumptions directly and show that they also hold for your system, and preferably also show that other predictions of the mechanism also hold in the data. A really great example of the assumption-testing part was done by Sid Redner (see Figure 4 of this paper), in which he showed that for citation networks, the linear preferential attachment assumption actually holds in the data.

Finally, the term "scale-free network" is overloaded in the literature, so I would strongly suggest avoiding it. People use it to refer to networks with power-law degree distributions and to networks grown by (linear) preferential attachment. But as we just explained, these two things are not the same, so using a single term to refer to both is just confusing. In your case, a log-normal distribution is completely inconsistent with the classic linear preferential attachment mechanism, so if you decide that log-normal is the answer to question 1 (in my answer), then it would imply that your network is not 'scale free' in that sense. The fact that the upper tail is 'okay' as a power-law distribution would be meaningless in that case, since there is always some portion of the upper tail of any empirical distribution that will pass that test (and it will pass because the test loses power when there isn't much data to go on, which is exactly what happens in the extreme upper tail).