Solved – Efficiently normalize word embeddings

natural languagenormalizationword embeddingsword2vec

I'm using glove word embedding and would like to [-1,1] normalize it using python. The data is in the format of a dict with the word as key and a np array as value. Thus I would have to loop through all 2m entries, get the min and max and then loop again to normalize it.

Is there a more efficient way to do that?

Thanks in advance!

Best Answer

A loop is your only option here if you have not saved your word embeddings in any other format such as a binary file. Just use a list comprehension which should be fairly quick even with 2m entries. Assuming your dictionary is named 'd' you could do the following:

import numpy as np     
emb = np.array([list(item.values()) for item in d.values()])

Once you converted the dictionary values into a numpy array you could normalize your data using some convenient tools from sklearn such as minmax scaler:

from sklearn.preprocessing import minmax_scale
emb_scaled = minmax_scale(emb, feature_range=(-1, 1))
Related Question