Solved – Understanding Singular Value Decomposition in the context of LSI

latent-semantic-indexingnatural languagersvd

My question is generally on Singular Value Decomposition (SVD), and particularly on Latent Semantic Indexing (LSI).

Say, I have $ A_{word \times document} $ that contains frequencies of 5 words for 7 documents.

A =  matrix(data=c(2,0,8,6,0,3,1,
                   1,6,0,1,7,0,1,
                   5,0,7,4,0,5,6,
                   7,0,8,5,0,8,5,
                   0,10,0,0,7,0,0), ncol=7, byrow=TRUE)
rownames(A) <- c('doctor','car','nurse','hospital','wheel')

I get the matrix factorization for $A$ by using SVD: $A = U \cdot D \cdot V^T $.

s = svd(A)
D = diag(s$d) # singular value matrix
S = diag(s$d^0.5 ) # diag matrix with square roots of singular values.

In 1 and 2, it is stated that:

$WordSim = U \cdot S$ gives the word similarity matrix, where the rows of $WordSim $ represent different words.

WordSim = s$u %*% S

$DocSim= S \cdot V^T$ gives the document similarity matrix where the columns of $DocSim$ represent different documents.

DocSim = S %*% t(s$v)

Questions:

Algebraically, why are $WordSim$ and $DocSimS$ word/document similarity matrices? Is there an intuitive explanation?
Based on the R example given, can we make any intuitive word count / similarity observations by just looking at $WordSim$ and $DocSim$ (without using cosine similarity or correlation coefficient between rows / columns)?

enter image description here

Best Answer

Matrix factorization using SVD decomposes the input matrix into three parts:

The left singular vectors $U$. The first column of this matrix specifies on which axis the rows of the input matrix vary the most. In your case, the first column tells you which words vary together the most.
The singular values $D$. These are scalings. These are relative to each other. If the first value of $D$ is twice as big as the second it means that the first singular vector (in $U$ and $V^T$) explain twice as much variation as the seconds singular vector.
The right singular vectors $V^T$. The first row of this matrix specifies on which axis the columns of the input matrix vary the most. In your case, the first row tells you which documents vary together the most.

When words or documents vary together it indicates that they are similar. For example, if the word doctor occurs more often in a document, the word nurse and hospital also occur more. This is shown by the first scaled left singular vector, the first column of $WordSim$.You can validate this result by looking at the input data. Notice that when nurse does occur, hospital also occurs and when it does not occur, hospital also does not occur.

Related Solutions

Solved – Singular value decomposition procedure in R

I did the exact same thing with Matlab/Octave and got the right result.

A = [4 4;-3 3]

At= transpose(A)
AtA = At * A
AAt = A * At

[vec1, val1] = eig(AtA)
V = vec1

[vec2, val2] = eig(AAt)
U = vec2

sigma=sqrt(val1)

SVD_of_A = U * sigma * transpose(V)

Result:

SVD_of_A =

   4.0000   4.0000
  -3.0000   3.0000

The error is that $eigen(AAt)$ should produce eigenvector equals to have the right result to:

      [,1] [,2]
[1,]    1    0
[2,]    0    1

See comment of Cardinal

Solved – Term frequency/inverse document frequency (TF/IDF): weighting

Wikipedia has a good article on the topic, complete with formulas. The values in your matrix are the term frequencies. You just need to find the idf: (log((total documents)/(number of docs with the term)) and multiple the 2 values.

In R, you could do so as follows:

set.seed(42)
d <- data.frame(w=sample(LETTERS, 50, replace=TRUE))
d <- model.matrix(~0+w, data=d)

tf <- d
idf <- log(nrow(d)/colSums(d))
tfidf <- d

for(word in names(idf)){
  tfidf[,word] <- tf[,word] * idf[word]
}

Here's the datasets:

> colSums(d)
wA wC wD wF wG wH wJ wK wL wM wN wO wP wQ wR wS wT wV wX wY wZ 
 3  1  3  1  1  1  1  2  4  2  2  1  1  3  2  2  2  4  5  5  4 
> head(d)
  wA wC wD wF wG wH wJ wK wL wM wN wO wP wQ wR wS wT wV wX wY wZ
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
3  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
5  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
6  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0
> head(round(tfidf, 2))
  wA wC wD wF wG   wH wJ wK wL wM   wN wO wP   wQ wR wS wT   wV  wX  wY wZ
1  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 2.3 0.0  0
2  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 0.0 2.3  0
3  0  0  0  0  0 3.91  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 0.0 0.0  0
4  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 2.53 0.0 0.0  0
5  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 2.81  0  0  0 0.00 0.0 0.0  0
6  0  0  0  0  0 0.00  0  0  0  0 3.22  0  0 0.00  0  0  0 0.00 0.0 0.0  0

You can also look at the idf of each term:

> log(nrow(d)/colSums(d))
      wA       wC       wD       wF       wG       wH       wJ       wK       wL       wM       wN       wO       wP       wQ       wR       wS       wT       wV       wX       wY       wZ 
2.813411 3.912023 2.813411 3.912023 3.912023 3.912023 3.912023 3.218876 2.525729 3.218876 3.218876 3.912023 3.912023 2.813411 3.218876 3.218876 3.218876 2.525729 2.302585 2.302585 2.525729

Best Answer

Related Solutions

Solved – Singular value decomposition procedure in R

Solved – Term frequency/inverse document frequency (TF/IDF): weighting

Related Question