Normalization – Normalize Before or After Merging Datasets in Sentiment Analysis

normalizationsentiment analysis

I'm working on a project to see if the sentiments of a book series matches that of the movie adaptations. I perform sentiment analysis and compare the results. The books contain far more words than the movie scripts, so I do a min-max normalization to scale the output. However, I can't decide when to execute the normalization because the resulting plot differ greatly. Do I normalize before or after merging the sentiment scores into one dataset? Does this choice affect the validity of my results?

If I normalize after merging the data from the books and the movies, it doesn't appear normalized at all. As far as I can tell, this isn't on the same scale:

If I normalize the data separately before merging, I get better results. I do wonder if this is the correct way, however. After all, I want solid evidence if the movie adaptations match the thematics of the books.

Best Answer

You should definitely be scaling each dataset prior to merging for comparison, otherwise you are losing the benefit of scaling. Scaling beforehand helps maintain the validity of your results, as many forms of analyses are sensitive to relative scales, especially in comparisons of groups.

Intuitively, you were right to scale to account for differences in total word count respectively for the movies and books, as this allows you to make more of a comparison of proportion of words giving each sentiment.

Related Solutions

Solved – data normalization after dimension reduction for classification

PCA does require normalization as a pre-processing step.

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Source: here

Would a further step of data normalization harm the data?

No, it would not harm the data. But would it be really necessary?

import numpy as np
from sklearn.decomposition import PCA

mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)

pca = PCA(n_components=1, whiten=True)
pca.fit(values)

values_ = pca.transform(values)
print np.var(values_)

The following exercise returns 1.0

Why? We are projecting two whitened features onto the first component. Let's assume that a point in the whitened space is identified by a vector ($a$) The new vector ($a'$) is the result of the transformation $$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$

where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.

What is the variance of the whitened feature once projected on the principal component?

$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$

$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.

Solved – Should data be normalized before or after imputation of missing data

In my opinion, since you are using kNN imputation, and kNN is based on distances you should normalize your data prior to imputation kNN. The problem is, the normalization will be affected by NA values which should be ignored.

For instance, take the e.coli, in which variables magnitude is quite homogeneous. Creating NA's artificially for percentages from .05 to .20 by 0.1 will produce mean square error (MSE) between original and imputed dataset as follow:

0.08380378; 0.08594711; 0.09165323; 0.1005489; 0.09978495; 0.1120758; 0.1046071; 0.1048477; 0.1087384; 0.1283818; 0.1201014; 0.1264724; 0.1337024; 0.1457246; 0.1365055; 0.154879;

Otherwise, if you take Breast Tissue dataset, which have a heterogeneous magnitude through data you will have,

889.4696; 927.6151; 773.7256; 1229.74; 3356.833; 645.8142; 755.98; 2110.523; 987.5008; 1796.339; 1603.461; 1476.863; 2887.509; 2001.222; 905.6305; 2334.935;

This is, with normalization you can keep a reasonable track of MSE.

Best Answer

Related Solutions

Solved – data normalization after dimension reduction for classification

Solved – Should data be normalized before or after imputation of missing data

Related Question