Normalization – Normalize Before or After Merging Datasets in Sentiment Analysis

normalizationsentiment analysis

I'm working on a project to see if the sentiments of a book series matches that of the movie adaptations. I perform sentiment analysis and compare the results. The books contain far more words than the movie scripts, so I do a min-max normalization to scale the output. However, I can't decide when to execute the normalization because the resulting plot differ greatly. Do I normalize before or after merging the sentiment scores into one dataset? Does this choice affect the validity of my results?

If I normalize after merging the data from the books and the movies, it doesn't appear normalized at all. As far as I can tell, this isn't on the same scale:
enter image description here

If I normalize the data separately before merging, I get better results. I do wonder if this is the correct way, however. After all, I want solid evidence if the movie adaptations match the thematics of the books.
enter image description here

Best Answer

You should definitely be scaling each dataset prior to merging for comparison, otherwise you are losing the benefit of scaling. Scaling beforehand helps maintain the validity of your results, as many forms of analyses are sensitive to relative scales, especially in comparisons of groups.

Intuitively, you were right to scale to account for differences in total word count respectively for the movies and books, as this allows you to make more of a comparison of proportion of words giving each sentiment.

Related Question