Choose the Normalization method for a co-occurence matrix

categorical dataeuclideanmatrixnormalizationz-score

I have a co-occurrence matrix about hashtags usage (The value in the cell means the number of times two hashtags appear together in a single tweet), it is transformed from a 2-mode matrix. Now I want to use Ucinet to normalize this matrix because I have 4 matrices like this in different periods. Indeed, I want to compare the usage of hashtags at different times, so I should normalize all 4 matrices to decrease dimensional effects.

But I find that ucinet offers different methods for that: z-scores, marginal, Euclidean and so on. Which is the best? I don't know how to choose and I also find there is some difference between outcomes by different methods. Thanks for your help!

Best Answer

Since the co-occurrence table is a quadratic and symmetric matrix around the main diagonal, it doesn't make any difference if you read it by rows or by columns. The main diagonal seems to be zero everywhere, which makes sense since it's unlikely that one repeats the same hashtag within the same tweet.

My suggestion is to normalize row-wise (or column-wise), that is, to divide each row by its total. In this way, the $i$th row of the table would represent the relative frequency distribution of co-occurrences for the hashtag $i$. You can now compare this frequency distribution with other frequency distributions from other tables or for other tags.

In my opinion, it doesn't make sense to compute quantiles or moment-based summaries on these distributions (so z-score is not meaningful either) since the variable in question is qualitative, i.g. having modalities "#1 vs #1", ..., "#1 vs #n".

You can instead use use the mode (i.e. the co-occurence with highest relative frequency) as a measure of location. As a measure of "variability" or entropy you may use the Shannon entropy.

If we denote by $p_{j|i}$ be the relative frequency of the co-occurence "#i vs #j", for $j = 1,\ldots,k$, the Shannon entropy is

$$ H = \sum_{j=1}^k p_{i|j}\log p_{j|i}, $$

with $p_{j|i}\log p_{j|i} = 0$ if $p_{j|i}=0$. $H$ assumes value zero, i.e. its minimum, when the distribution is uniform. Furthermore, it can be shown that $H\leq \log k$, thus if you use $H$ to compare frequency distributions with different $k$'s, it's better to use its normalized version $$ H_n = \frac{H}{\log(k)}. $$

Related Solutions

Solved – Normalize sample data for clustering

Clustering in general requires a similarity metric to compute a partitioning of your data. Do you know how to compute the similarity of $\vec{a}$ to $\vec{b}$? Whether you need normalization or not will mainly depend on this question. If you don't have such a metric/measure, and you want to go with the regular Euclidean distance, normalizing your data -- bringing each variable to zero mean and unit variance -- would be recommended. Because if you don't, the scores with the largest range will dominate the distance computation.

Solved – Normalization with z-score to control for categorical variable

A simple solution is to remove the mean effect of Teacher. I think that normalizing by the standard deviation per class (hence the z-score) is not necessarily and makes your data less intuitive. Subtracting the mean per Teacher can be done easily with dplyr.

normalized_report <- report.ex %>%
  group_by(Teacher) %>%
  mutate(centralized_improvement = Improvement - mean(Improvement)) %>%
ungroup()

This solution is a simple but effective approach if you are interested in comparing class mates with each other.

Also dividing by the standard deviation (hence, the z-score) is not necessarily, I think. This would mean that the difference in standard deviation is also caused by the Teacher. E.g. if you hypothesize that a teacher gives only attention to his favorites.

Best Answer

Related Solutions

Solved – Normalize sample data for clustering

Solved – Normalization with z-score to control for categorical variable

Related Question