Solved – How to Calculate a Z-Score from Power Log Distributions

internetnormal distributionpower lawstandard deviationt-test

Please forgive me if the question is non-sensical as I only have a elementary knowledge of statistics, and I have no idea where to start.

I want to perform an analysis of some web analytics information, and determine which of our partners we should be building stronger relationships with because they are sending us highly engaged users.

enter image description here

My first thought was to calculate the z-score for each of the main metrics in the data set:

PageViews
Visits
Visitors
Repeat Visits
etc.

And then for partner sites that have a positive Z-Score for:

Repeat Visits
Pages / Visit
Avg. Time on Page

with a negative Z-Score for Bounce Rates, those were the sites that would be more desirable to improve our relationships with.

Additionally, we would focus on the partnersites that met that criteria and also had a negative Z-Score for PageViews or Visits. It would be my assumption that these sites would be the ones that are driving highly engaged users but at a very low volume.

However, I'm not sure if this assumption is correct. Given the Z-Score is based on Standard Deviation, which from what I understand is really meant for normal distributions, it doesn't seem like the right path given PageViews, Visits, Visitors, and sometimes Pages/Visits fits more of the Power Law distribution than a normal distribution.

Best Answer

It doesn't matter so much that the Z score is often compared to a symmetrical normal distribution. The key thing about your proposed approach is that it will give you a positive value when the partner has an "above mean" (common sense term = "above average") number of repeat visits, pages per visit, or time per page. So long as you are aware that this is what it is doing, it's not necessarily a bad approach.

You might want to consider alternative cut off points - for example, the median of each of these variables is likely to be lower than the mean; if you used this as the cut-off point instead you would be getting the best half of partners against each criterion. However, any cut-off point is arbitrary and its use depends on whether the results are practicable (ie does it give you a reasonable number of partners to use).

So the short answer to your question is - there is no problem with using these Z scores even when the underlying distribution is skewed. Just be aware that a positive Z score means that partner has a value higher than the mean for that variable, nothing else. And the mean is susceptible to outliers ie a single partner with a squillion repeat visits will result in the mean being so high that only that partner makes your list. So watch out for that problem and consider using another cut-off (median, or 75th percentile) instead. Ultimately, the answer depends on your business drivers.

The next step up in analytical techniques is to find a single criterion against which to rank partners, which somehow takes into account all three of the variables you are interested in. A common naive way to go about this is to take averages of standardised scores; more sophisticated alternatives are to use principal components analysis or factor analysis. But this takes us away from the actual question.

My most important tip - use graphical techniques, particularly scatterplots showing two variables at a time with a point representing each partner; ideally with the more interesting points neatly labelled for you (the number one feature lack of Excel, unfortunately). A "scatterplot matrix" is a handy technique if you have the software to do it easily.

Related Solutions

Solved – Fit power law for distributions with zeroes

I guess your "this is a very newbie question" refers to this of your many questions:

"...but conceptually, would the point of a power law be violated if there 
are some zeroes in the data?"**

No. The concept remains valid as the same class of distributions may be applied to data with or without zeros. You may be interested in reading more about Tweedie class of distributions here and then here.

For example, the well-known Taylor’s law says that the variance is proportional to a power of the mean. Taylor’s law is mathematically identical to the variance-to-mean power law that characterizes the Tweedie distributions, that is for any random variable that obeys a Tweedie distribution, the variance relates to the mean by the power law. Since that "any random variable" can be discrete, continuous or a combination of both, the concept of the power law may equally apply to data that are counts (Poisson), reals (Normal), positive reals (Gamma), or positive reals with the added positive mass at zero (compound Poisson–gamma).

Given your "there are some zeros in the data" and your comment "yes, my values are counts", simple Poisson may work. If not, e.g. zeros are too few or too many, you may try Neyman Type A distribution (this R package manual mentions it the context of the Tweedie class of distributions).

I hope some of the above helps.

Solved – Using log-log graph to find equation of power law relationship

I did the following calculation and got a different y value for x=0.25.
Intercept and slope are similar to OP's question.

x = (0.25, 1, 2, 2.75, 4.25, 6.5, 8, 13.25, 16.25, 19.25, 19.75, 26.5, 31, 37.75)
y = (4.485605e-08, 1.430240e-08, 7.638950e-09, 6.776308e-09, 3.269885e-09, 
                 2.609455e-09, 4.378785e-09, 2.260540e-09, 2.039074e-09, 7.119317e-10,
                 2.252598e-09, 1.617082e-09, 7.511261e-09, 1.519275e-09)
my_df = pd.DataFrame({'x':x , 'y': y})
my_df.loc[:, 'log_x'] = map(lambda x: math.log(x), my_df['x'])
my_df.loc[:, 'log_y'] = map(lambda x: math.log(x), my_df['y'])
model = LinearRegression()
model.fit(my_df['log_x'].values.reshape(-1, 1), my_df['log_y'].values)
model.coef_
[-0.60989947]
model.intercept_
[-18.17749636096132]
X = 0.25
log_y = model.intercept_ + model.coef_ * math.log(X)
y = math.exp(log_y)
print(y)
2.970364e-08

Best Answer

Related Solutions

Solved – Fit power law for distributions with zeroes

Solved – Using log-log graph to find equation of power law relationship

Related Question