Solved – Word2Vec and PyTorch – am I approaching this correctly

machine learningpythontext miningword2vec

My understanding of Word2Vec is that the library allows for generation of an array of numbers that approximates the meaning of a word relative to others in a sentence.

My use of Word2Vec

e.g. consider the following sentence:

"Machine learning with Python is very useful".

To this end, I trained a model using Word2Vec as follows:

from gensim.models import Word2Vec

# define training data
sentences = [[
    'machine',
    'learning',
    'with',
    'python',
    'is',
    'very',
    'useful',
    ]]

# train model
model = Word2Vec(sentences, min_count=1)

# summarize the loaded model
print (model)

# summarize vocabulary
words = list(model.wv.vocab)
print (words)

# access vector for one word
print (model['machine'])

# save model
model.save('model.bin')

# load model
new_model = Word2Vec.load('model.bin')
print (new_model)

When I printed for the word 'machine', I obtained an array of numbers:

>>> print (model['machine'])
[-5.3296558e-04 -2.4796894e-03 -3.3167074e-03 -2.1227452e-03
  1.6867702e-03  3.2749411e-03 -2.1588034e-03  4.9430062e-03
  ......
 -4.1352920e-03 -4.3468783e-03  2.4636291e-04 -1.8679388e-03
 -2.5670610e-03 -3.5702281e-03 -3.4511611e-03 -3.5669175e-03]

Training a neural network with PyTorch

I then obtained an array of numbers for the other words in the sentence, i.e. 'learning', 'with', 'python', 'is', 'very', 'useful'.

Using these arrays of numbers, I trained a neural network with PyTorch:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

import os
path = '/home/yourdirectory'
os.chdir(path)
os.getcwd()

# Variables

dataset = np.loadtxt('numbers2.csv', delimiter=',')
x = dataset[:, 0:6]
y = dataset[:, 0]
y = np.reshape(y, (-1, 1))

(X_train, X_test, y_train, y_test) = train_test_split(x, y, test_size=0.01)

# pytorch array

xtrain = torch.Tensor(X_train)
xtrain.size
ytrain = torch.Tensor(y_train)
ytrain.size

model = nn.Sequential(nn.Linear(6, 10), nn.ReLU(), nn.Linear(10, 1),
                      nn.Sigmoid())

criterion = torch.nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(500):

    # Forward Propagation
    y_pred = model(xtrain)

    # Compute and print loss
    loss = criterion(y_pred, ytrain)
    print ('epoch: ', epoch, ' loss: ', loss.item())

    # Zero the gradients
    optimizer.zero_grad()

    # perform a backward pass (backpropagation)
    loss.backward()

    # Update the parameters
    optimizer.step()

The training loss fell as expected as the number of epochs increased:

training loss

Essentially, I am trying to use PyTorch to train the text classification model using deep learning and thus obtain higher accuracy rates. Is my approach here correct, or have I missed the mark completely?

Best Answer

In general it seems like you're on the right track.

Things you should clarify to help you proceed:

  1. It seems like you are extracting the first 6 dimensions of the ~300 dimensions of the word2vec embedding. This is not advisable. What is in numbers2.csv? Note that word2vec embeds words, not sentences. If you want a model to do this for you could look to infersent of universal sentence embeddings.
  2. How do you plan on handling multiple words? With text, the typical approach is to use recurrent models. https://pytorch.org/docs/stable/nn.html#recurrent-layers
  3. What are you trying to predict? This may change things.

I think you should couple the two parts of the code, your embedding should be able to take into new raw text, tokenize it, and embed it correctly. You can load these embeddings from a pre-trained source (say with gensim as you are already using.)

Related Question