Solved – Understanding Feature Hashing

feature-engineering

Wikipedia provides the following example when describing feature hashing; but the mapping does not seem consistent with the dictionary defined

For example, to should be converted to 3 according to the dictionary, but it is encoded as 1 instead.

Is there an error in the description? How does feature hashing work?

The texts:

John likes to watch movies. Mary likes too.
John also likes to watch football games.

can be converted, using the dictionary

{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, 
"football": 7, "games": 8, "Mary": 9, "too": 10}

to the matrix

[[1 2 1 1 1 0 0 0 1 1]
 [1 1 1 1 0 1 1 1 0 0]]

Best Answer

The matrix is constructed in the following way:

  • rows represent lines
  • columns represent features

and every entry matrix(i,j)=k means:

In line i, the word with index j appears k times.

So to is mapped to index 3. It appears exactly one time in line 1. So m(1,3)=1.

More examples

  • likes is mapped to index 2. It appears exactly two times in the first line. So m(1,2)=2
  • also is mapped to index 6. It does not appear in line 1, but one time in line 2. So m(1,6)=0 and m(2,6)=1.
Related Question