Solved – How to reduce dimension for text document dataset

datasetdimensionality reductiontext mining

I have a basic doubt in dimension reduction for text dataset eg. 20Newsgroup, rcv1 etc. Initially I extract the number of word occurrence in each document, i.e word x document matrix would be $n \times d$ where $n$ is the number of documents and $d$ is the dimension.

I would like to reduce the dimension, say $d_1 << d$. What is the standard technique of reducing the dimension?

  1. Choose the top $d_1$ feature from the original word occurrence matrix$( n \times d)$ and then calculate TF-IDF for the reduced matrix $(n \times d_1)$, or
  2. Calculate the TF-IDF matrix for $n \times d$ matrix and then select top $d_1$ features.

Also, it is mentioned in many literature that top feature are selected. I wanted to know what is selecting the top features means? How do they define it?

Best Answer

I will use established notations, when initial TF-IDF matrix stores documents at columns, and rows correspond to term occurrences or term's tf/idfs. Let $A$ be $(n \times d)$ matrix with $d$ domuments and $n$ dimensions.

The standart method to do feature reduction in text mining is latent semantic indexing. The key idea is applying a little modification of SVD decomposition for $n \times d$ TF-IDF matrix (or just word occurrence matrix).

Particularly, let our initial matrix $A$ be decomposed: $$A = S\times D \times T^t,$$ where $S, D, T^t$ have dimensions $(n \times r), (r \times r), (r \times d)$ respectively and $D$ is a diagonal matrix with $A$'th sorted singular values on diagonal: $D = diag(d_1,d_2,\dots,d_r), d_1 \le \dots d_r$.

The modification performed in latent semantid indexing is truncating the matrix $D$ so that only $k \le r$ largest singular values remained. It can be shown, that $A \approx A_k = S_k \times D_k \times T_k^t$, where $S_k$ is $n \times k$ matrix of first $k$ columns of $S$, $T_k^t$ is $k \times d$ matrix of first $k$ rows of $T^t$, and $D_k = diag(r_1,\dots,r_k)$. The matrix $T_k$ is a concept document matrix, which rows store reduced description of document. Further you apply text-mining algorithms to that matrix as you could apply them to initial TF-IDF matrix.