Text Mining – Measuring Semantic Distance Between Phrases Using Closest Distance in Hypernym Tree

distance-functionstext mining

Per my earlier question I'm trying to find a reasonable metric for the semantic distance between two short text strings. One metric mentioned in the answers of that question was to use shortest hypernym path to create a metric for phrases. So for instance, if I was to find the semantic distance between pig and dog, I could ask WordNet for all of their hypernyms:

pig=> swine=> even-toed ungulate=> hoofed mammal=> placental mammal=> mammal=> vertebrate=> chordate=> animal=> organism=> living thing=> object=> physical entity=> entity

dog=> canine=> carnivore=> placental mammal=> mammal=> vertebrate=> chordate=> animal=> organism=> living thing=> object=> physical entity=> entity

and I would find that the shortest path between pig and dog is 8 jumps – so semantic distance = 8.

If I wanted to extend this concept to entire phrases, then perhaps I could (naively) find the average distance between all word pairs in the phrases. (Obviously, one should be able to find something much better than this.)

My question: I'm sure someone has thought of this before. Where should I look in literature to find more information. And what are the hidden gotchas when using such an approach.

Best Answer

Graph Distances

For words I'd guess you're looking for a distance measure defined over a tree/graph, e.g. geodesic distance and related constructions, but which covers phrases. I wonder if you could treat the phrases as subtrees/subgraphs within the larger graph structure and use a tree similarity measure, e.g. tree edit distance to get a similarity measure?

Distributional Distances

But for a return to the geometrical (rather than graph structure) approach for answering your previous tweet-related question I suggest googling 'semantic space'. It was big in the late 90's and seems to be having a big data revival. You'll get a lot of Latent Semantic Analysis / Indexing but the intuitions about semantic distance in a semantic space are the same whether it's about words or phrases and whether or not there is a dimensionality reduction first. These intuitions are basically:

meaning is distributional similarity a.k.a. substitutability in context.
semantic similarity is usually the (cosine of the) angle between two word vectors.

A more or less random collection of links include e.g. this intro and this thesis, work by H. Schütze, T. Pederson, S. Finch and N. Chater, S. Evert, but there is a lot more out there.

In the general framework phrases can be turned into single vectors by adding (or otherwise combining) their word vectors together before angles are computed.

Some software for constructing these semantic spaces is here or more recently here.

Related Solutions

Solved – Mahalanobis distance as measure of dissimilarity between strings (sequences)

Manual calculation of Mahalanobis Distance is simple but unfortunately a bit lengthy:

>>> # here's the formula i'll use to calculate M/D:
>>> md = (x - y) * LA.inv(R) * (x - y).T

In other words, Mahalanobis distance is the difference (of the 2 data vecctors) multiplied by the inverse of the covariance matrix multiplied by the transpose of the difference (of the same 2 vectors, x & y)

>>> # your 2 data points whose Mahalanobis distance you wish to calculate
>>> x = NP.mat("1 1 1 1")
>>> y = NP.mat("2 1 1 1")

>>> # not enough data supplied in the OP to properly calculate covariance matrix,
>>> # so we'll make some up--a 10 rows of data points of same dimension as x & y
>>> #partition your data into classes (e.g., if you have two classes,
>>> # put all class I data points in one array & all class II points in another)

>>> # for instance pretend 'a' below is the matrix of of your data points
>>> (like x & y) all assigned to the same class
>>> a = NP.random.randint(0, 5, 40).reshape(10, 4)
>>> a
  array([[1, 2, 2, 1],
         [3, 0, 4, 4],
         [2, 3, 1, 1],
         [1, 0, 3, 0],
         [4, 4, 3, 2],
         [4, 0, 0, 4],
         [4, 4, 0, 1],
         [4, 1, 2, 1],
         [4, 0, 3, 4],
         [2, 2, 4, 1]])

>>> # "mean center" this data prior to calculating covariance matrix
>>> mx = NP.mean(a, axis=0)
>>> a1 = a - mx

>>> # sanity check:
>>> NP.mean(a1, axis=0)
  array([ 0., -0., -0.,  0.])

>>> # calculate coveriance matrix of the mean-centered data matrix, a1
>>> R = NP.corrcoef(a1, rowvar=0)
>>> R
  array([[ 1.   ,  0.084, -0.281,  0.561],
         [ 0.084,  1.   , -0.284, -0.461],
         [-0.281, -0.284,  1.   ,  0.059],
         [ 0.561, -0.461,  0.059,  1.   ]])

>>> # quick sanity check(s): 
>>> # (i) is cov matrix n x n? and a; and
>>> # (ii) main diagonal consists of all '1's 
>>> # (because a number and itself of course have perfect covariance)

>>> # repeat those 2 steps (mean center + calculate covariance matrix)
>>> # for the other data matrices (comprised of data points 
>>> # in the remaining classes).

>>> # next calculate 'pooled covariance matrix' by taking weighted average 
>>> of these covariance marices (weighted according to number of rows in 
>>> # the original data matrices used to calculate the covariance matrices

>>> # convert element-wise NumPy arrays to linear algebra matrices
>>> R = NP.matrix(R)    

>>> # calculate the inverse of the weighted average covariance matrix
>>> RI = LA.inv(R)

>>> # now just plug the values into the Mahalanobis code i recited near the top
>>> # we'll do it step-wise so we can see intermediate results:
>>> # another sanity check: we are calculating a distance obviously so the final
>>> # should be a 1 x 1 matrix (scalar)

>>> xy_diff = x - y
>>> a = xy_diff * RI
>>> a
 matrix([[-2.034,  0.737, -0.452,  1.508]])

>>> b = xy_diff.T
>>> a * b
  matrix([[2.043]])     # the Mahalanobis distance for the 2 vectors, x & y

Other (faster) ways to calculate Mahalanobis distance:

The excellent matrix computation mega-library for Python, SciPy, actually has a module "spatial" which inclues a good Mahalanobis function. I can recommend this highly (both the library and the function); I have used this function many times and on several ocassions i cross-verified the results with those from other libraries.

Or you can use R, which has a bult-in function of the same name to calculate M/D, mahalanobis. A concise and useful help page for this function can be accessed by typing in the R interpreter:

?mahalanobis

Finally, i am quite sure that other formulations of Mahalanobis Distance can be found in various R libraries, particularly in some of the libraries in the Bioconductor Project which contains a huge set of R libraries, or "Packages", for the quantitative study of life sciences) then you can calculate Mahalanobis distance using a built-in function of the same name ("mahalanobis.") The reason i mention this is that these domain-specific formulations are likely to have helper functions to save time on the tedious predicate steps e.g., mean-centering and calculating the weighted average covariance matrix.

Solved – How to compute a measure of distance between sites with continuous variables

I tried this and got different results (as expected) from the distances of a data frame and its transpose:

library(ade4)
x1 <- rnorm(10, 2, 1)
x2 <- rnorm(10,1,1)
dframe <- cbind(x1,x2)
dist1 <- dist.quant(dframe, 1, diag = TRUE, upper = TRUE)
dist1
dist2 <- dist.quant(t(dframe),1, diag = TRUE, upper = TRUE)
dist2

dist2 gives a single distance (between x1 and x2). dist1 gives a $10\times10$ matrix (since I put upper = TRUE and diagonal = TRUE)

Best Answer

Related Solutions

Solved – Mahalanobis distance as measure of dissimilarity between strings (sequences)

Solved – How to compute a measure of distance between sites with continuous variables

Related Question