Solved – explain meaning and purpose of L2 normalization

Let me say at the outset that I am very new to machine learning, and not great at math. I understand what TF-IDF does, but in the book I am reading it also notes the following (it's discussing how scikit-learn does things):

Both classes [TfidfTransformer and TfidfVectorizer] also apply L2 normalization after computing the tf-idf representation; in other words, they rescale the representation of each document to have Euclidean norm 1. Rescaling in this way means that the length of a document (the number of words) does not change the vectorized representation.

That's all it has to say about the subject. What I think it means, and let me know if I'm wrong, is that we scale the values so that if they were all squared and summed, the value would be 1 (I took this definition from http://kawahara.ca/how-to-normalize-vectors-to-unit-norm-in-python/).

So the idea, then, is that the feature values become proportionate to each other. I'm not totally sure how that would be helpful for the model, though. Does it help the overall classifier learn if some examples don't have a higher total number of "turned on features" than others?

Also, here's a basic question: Does L2 normalization have anything to do with L2 regularization? Maybe it's just that both of them involve squaring and summing terms?

Whatever insight you can share, would be most appreciated!

Best Answer

we scale the values so that if they were all squared and summed, the value would be 1

That's correct.

I'm not totally sure how that would be helpful for the model, though

Consider a simpler case, where we just count the number of times each word appears in each document. In this case, two documents might appear different simply because they have different lengths (the longer document contains more words). But, we're more interested in the meaning of the document, and the length doesn't contribute to this. Normalizing lets us consider the frequency of words relative to each other, while removing the effect of total word count.

Does L2 normalization have anything to do with L2 regularization?

L2 regularization operates on the parameters of a model, whereas L2 normalization (in the context you're asking about) operates on the representation of the data. They're not related in any meaningful sense, beyond the superficial fact that both require computing L2 norms (summing squared terms, as you say).

But, note that L2 normalization is a generic operation, and can apply in contexts beyond the one you're asking about. There do exist situations where one could draw a connection between the two concepts, but I think that's beyond the scope of this question.

Best Answer

Related Solutions

Solved – TF-IDF versus Cosine Similarity in Document Search

Transformers – Understanding BERT MLM with 80% [MASK], 10% Random Words, and 10% Same Word

Related Question