Solved – the purpose of row normalization

distancek nearest neighbournormalizationsimilarities

I understand the reasoning behind column normalization, as it causes features to be weighted equally, even if they are not measured on the same scale – however, often in the nearest neighbour literature, both columns and rows are normalized. What is the row normalization for/why normalize rows? Specifically, how does the result of the row normalization affect the similarity/distance between row vectors?

Best Answer

This is a relatively old thread but I recently encountered this issue in my work and stumbled upon this discussion. The question has been answered but I feel that the danger of normalizing the rows when it is not the unit of analysis (see @DJohnson's answer above) has not been addressed.

The main point is that normalizing rows can be detrimental to any subsequent analysis, such as nearest-neighbor or k-means. For simplicity, I will keep the answer specific to mean-centering the rows.

To illustrate it, I will use simulated Gaussian data at the corners of a hypercube. Luckily in R there is a convenient function for that (the code is at end of the answer). In the 2D case it is straightforward that the row-mean-centered data will fall on a line passing through the origin at 135 degrees. The simulated data is then clustered using k-means with correct number of clusters. The data and the clustering results (visualized in 2D using PCA on the original data) look like this (the axes for the leftmost plot are different). The different shapes of the points in the clustering plots refer to the ground-truth cluster assignment and the colors are result of the k-means clustering.

enter image description here

The top-left and bottom-right clusters get cut in half when the data is row-mean-centered. So the distances after row-mean-centering get distorted and are not very meaningful (at-least based on the knowledge of the data).

Not so surprising in 2D, what if we use more dimensions? Here is what happens with 3D data. The clustering solution after row-mean-centering is "bad".

enter image description here

And similar with 4D data (now shown for brevity).

Why is this happening? The row-mean-centering pushes the data into some space where some features come closer than otherwise they are. This should be reflected in the correlation between the features. Let's look at that (first on the original data and then on the row-mean-centered data for 2D and 3D cases).

[,1] [,2] [1,] 1.000 -0.001 [2,] -0.001 1.000 [,1] [,2] [1,] 1 -1 [2,] -1 1 [,1] [,2] [,3] [1,] 1.000 -0.001 0.002 [2,] -0.001 1.000 0.003 [3,] 0.002 0.003 1.000 [,1] [,2] [,3] [1,] 1.000 -0.504 -0.501 [2,] -0.504 1.000 -0.495 [3,] -0.501 -0.495 1.000 So it looks like row-mean-centering is introducing correlations between the features. How is this affected by number of features? We can do a simple simulation to figure that out. The result of the simulation are shown below (again the code at the end).

enter image description here

So as the number of features increase the effect of row-mean-centering seems to diminish, at-least in terms of the introduced correlations. But we just used uniformly distributed random data for this simulation (as is common when studying the curse-of-dimensionality).

So what happens when we use real data? As many times the intrinsic dimensionality of the data is lower the curse might not apply. In such a case I would guess that row-mean-centering might be a "bad" choice as shown above. Of course, more rigorous analysis is needed to make any definitive claims.

Code for clustering simulation

palette(rainbow(10))
set.seed(1024)
require(mlbench)
N <- 5000
for(D in 2:4) {
X <- mlbench.hypercube(N, d=D)
sh <- as.numeric(X$classes)
K <- length(unique(sh))
X <- X$x

Xc <- sweep(X,2,apply(X,2,mean),"-")
Xr <- sweep(X,1,apply(X,1,mean),"-")

show(round(cor(X),3))
show(round(cor(Xr),3))

par(mfrow=c(1,1))

k <- kmeans(X,K,iter.max = 1000, nstart = 10)
kc <- kmeans(Xc,K,iter.max = 1000, nstart = 10)
kr <- kmeans(Xr,K,iter.max = 1000, nstart = 10)
pc <- prcomp(X)
par(mfrow=c(1,4))

lim <- c(min(min(X),min(Xr),min(Xc)), max(max(X),max(Xr),max(Xc)))
plot(X[,1], X[,2], xlim=lim, ylim=lim, xlab="Feature 1", ylab="Feature 2",main="Data",col=1,pch=1)
points(Xc[,1], Xc[,2], col=2,pch=2)
points(Xr[,1], Xr[,2], col=3,pch=3)
legend("topleft",legend=c("Original","Center-cols","Center-rows"),col=c(1,2,3),pch=c(1,2,3))
abline(h=0,v=0,lty=3)

plot(pc$x[,1], pc$x[,2], col=rainbow(K)[k$cluster], xlab="PC 1", ylab="PC 2", main="Cluster original", pch=sh)
plot(pc$x[,1], pc$x[,2], col=rainbow(K)[kc$cluster], xlab="PC 1", ylab="PC 2", main="Cluster center-col", pch=sh)
plot(pc$x[,1], pc$x[,2], col=rainbow(K)[kr$cluster], xlab="PC 1", ylab="PC 2", main="Cluster center-row", pch=sh)
}

Code for increasing features simulation

set.seed(2048)
N <- 1000
Cmax <- c()
Crmax <- c()
for(D in 2:100) {
X <- matrix(runif(N*D), nrow=N)    
C <- abs(cor(X))
diag(C) <- NA
Cmax <- c(Cmax, max(C, na.rm=TRUE))

Xr <- sweep(X,1,apply(X,1,mean),"-")
Cr <- abs(cor(Xr))
diag(Cr) <- NA
Crmax <- c(Crmax, max(Cr, na.rm=TRUE))
}
par(mfrow=c(1,1))
plot(Cmax, ylim=c(0,1), ylab="Max. cor.", xlab="#Features",col=1,pch=1)
points(Crmax, ylim=c(0,1), col=2, pch=2)
legend("topright", legend=c("Original","Center-row"),pch=1:2,col=1:2)

EDIT

After some googling ended up on this page where the simulations show similar behavior and proposes that the correlation introduced by row-mean-centering to be $-1/(p-1)$.