Choose the Normalization method for a co-occurence matrix

categorical dataeuclideanmatrixnormalizationz-score

I have a co-occurrence matrix about hashtags usage (The value in the cell means the number of times two hashtags appear together in a single tweet), it is transformed from a 2-mode matrix. Now I want to use Ucinet to normalize this matrix because I have 4 matrices like this in different periods. Indeed, I want to compare the usage of hashtags at different times, so I should normalize all 4 matrices to decrease dimensional effects.

But I find that ucinet offers different methods for that: z-scores, marginal, Euclidean and so on. Which is the best? I don't know how to choose and I also find there is some difference between outcomes by different methods. Thanks for your help!

Best Answer

Since the co-occurrence table is a quadratic and symmetric matrix around the main diagonal, it doesn't make any difference if you read it by rows or by columns. The main diagonal seems to be zero everywhere, which makes sense since it's unlikely that one repeats the same hashtag within the same tweet.

My suggestion is to normalize row-wise (or column-wise), that is, to divide each row by its total. In this way, the $i$th row of the table would represent the relative frequency distribution of co-occurrences for the hashtag $i$. You can now compare this frequency distribution with other frequency distributions from other tables or for other tags.

In my opinion, it doesn't make sense to compute quantiles or moment-based summaries on these distributions (so z-score is not meaningful either) since the variable in question is qualitative, i.g. having modalities "#1 vs #1", ..., "#1 vs #n".

You can instead use use the mode (i.e. the co-occurence with highest relative frequency) as a measure of location. As a measure of "variability" or entropy you may use the Shannon entropy.

If we denote by $p_{j|i}$ be the relative frequency of the co-occurence "#i vs #j", for $j = 1,\ldots,k$, the Shannon entropy is

$$ H = \sum_{j=1}^k p_{i|j}\log p_{j|i}, $$

with $p_{j|i}\log p_{j|i} = 0$ if $p_{j|i}=0$. $H$ assumes value zero, i.e. its minimum, when the distribution is uniform. Furthermore, it can be shown that $H\leq \log k$, thus if you use $H$ to compare frequency distributions with different $k$'s, it's better to use its normalized version $$ H_n = \frac{H}{\log(k)}. $$

Related Question