Solved – In natural language parsing, what is the feature function

machine learningmaximum-entropynatural language

I'm working in the area of natural language processing, to be specific I'm reviewing the parsers that take advantage of data-mining techniques.

I've read an introduction to natural language processing, specifically the maximum entropy approach by Della Pietra and others. In that paper, the authors explain that features are indicator functions that take into account certain aspects of the text. The problem is that this aspects affect the maximum entropy function.

In the case of parsing, I have problems understanding the concept of features. For parsing one must have:

  1. Two sets, $X$ for the sentences and $Y$ for the parses
  2. For each sentence $x \in X$, one must have the N-best list of candidate parses; denoted $GEN(x) \subset Y$.
  3. Training samples $(x_j,y_j)$. The $x_j$ are sentences and the $y_j$ are parses.
  4. And a feature mapping $\Phi(X,Y)$ that maps each pair (x,y) to a vector of feature values.

The problem of parsing will consist of finding the weight vector such that the following expression is minimized

$$
-\sum log P(y_j|x_j,w) + \lambda ||w||_1
$$

where

$$
P(y_j|x_j,w) = \frac{exp \{w^T \Phi(x_j,y_j)\} }{ \sum_{u \in GEN(x)} exp\{ w^T\Phi(x_j,u) \} }
$$

My question is, what are the features in this case? Some codified version of the parsing tree? Are the features a statistical concept?

I will also appreciate any bibliography that can be useful in tackling this issue.

UPDATE: I've talked to a professor and he had just told me that the features are obtained after doing a classification algorithm based on the parse-tree. Do you know what kind of classification is done with the parse tree?

Best Answer

The term "feature" is typically used generally within the context of machine learning. Given some raw data matrix $X$ comprised of vectors $x_i$, and outputs $y_i$, you transform each $x_i$ into some $x_i^{'}$. The point of this is to have feature values that are as telling as possible with respect to the output; for example, you would want them to be correlated in some manner, or at least have like outputs near like inputs in Euclidean space. Then you train your model (linear regression, CART tree, whatever) on the new $x_i^{'}$.

In the context of NLP, given a sentence as your raw input data, perhaps (for example) your feature function maps each $x_i$ to a new vector containing counts of nouns, verbs and adjectives, respectively. You might take this approach if you thought the different parts of speech would impact the output variable significantly. In your case, it sounds like the data gets parsed and then the parsed data is somehow converted by the feature function into a real-valued vector suitable for running machine learning algos on. If your prof said that the vector was obtained by means of a classification algorithm, that simply means that a classification algorithm (of any type, be it SVM, LDA, Naive Bayes, etc.) was used to create the features. It seems he's doing some sort of nested learning. Perhaps he's classifying whether a particular parsed sentence has feature Q, and then putting the certainty of having feature Q into the feature vectors, for example.

Related Question