Solved – Dealing with Sparse Matrices and multiple numerical features when training algorithm

machine learningnatural languagesparse

I have a data frame that looks as follows:

                                    description      priority  CDT  JDT  
0  Create Help Index Fails with seemingly incorre...       P3    0    0       
1  Internal compiler error when compiling switch ...       P3    0    1       
2  Default text sizes in org.eclipse.jface.resour...       P3    0    0       
3  [Presentations] [ViewMgmt] Holding mouse down ...       P3    0    0       
4  Parsing of function declarations in stdio.h is...       P2    1    0       

PDE  Platform  Web Tools  priorityLevel  
0         0          0              2  
1         0          0              2  
2         1          0              2  
3         1          0              2  
4         0          0              1  

I am currently trying to train an ML algorithm that would take the text in 'description' along with the rest of the numerical features except for 'priority' (discarded) and 'priorityLevel' (true labels).

This is basically an NLP application. The issue I'm having is that 'description' must first go through a 'CountVectorizer()' function:

X = df['description']
cv = CountVectorizer()
X = cv.fit_transform(X)

The output that returns is incompatible with the rest of the data frame when I go to pass it to the training algorithm.

I need to be able to combine X after it has been vectorized, along with df[['CDT', 'JDT', 'PDE', 'Platform', 'Web Tools']] into a single variable in order to split and train:

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=101)

nb = MultinomialNB()
nb.fit(X_train, y_train)

In essence, X should contain the vectorized text, along with the numerical variables. All efforts thus far have failed.

I have tried to do through a pipeline as well:

pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts.
    ('classifier', MultinomialNB()),
])

pipeline.fit(X_train,y_train)

But I get errors indicating that the sizes are incompatible.

Does anyone know of an easier way to accomplish bringing the sparse matrix returned by the vectorizer along with the numerical ones so that I may train the algorithm?

All help is appreciated.

Edit:

I have trained this algorithm with no problems whatsoever using only the vectorized text. My issue arises when trying to incorporate additional features into the training set.

Best Answer

From what I understand you use some sort of term-frequency matrix and additional features.

That means that each example is represented using $x_{TF}$ and $x_{F}$ where $x_f$ is value of feature $f$.

What you can do is represent each $x$ by concatenation of two encodings. If you don't have too many features it will be possible.

Since sklearn's countvectorizer returns sparse matrices (in csr from what I recall), you just need to convert $X_F$ matrix to sparse matrix (using scipy.sparse.csr_matrix) and then concatenate it with $X_{TF}$ (using scipy.sparse.hstack).

Related Question