Solved – Regression analysis in R using text field

natural languagerregression

I'm working in R. I'd like to run a regression analysis for predicting price against terms in a text field.

I have a dataset of jewellery auction listings, with price paid, date, and an unstructured description of the item type:

text,date,price_usd
"Ruby necklace, Spanish",1925,45000
"Diamond ring, 0.7 carat, bezier cut",1972,24000
"Diamond necklace",1980,87000
...

I know how to run a linear regression for price against date:

data <- read.csv('jewels.csv')
lm1 <- lm(data$price~data$date)
summary(lm1)

Now what I'd like to do is build a similar model, using the words in the description field that are most associated with higher prices.

Intuitively I'd guess these include "diamond" and "necklace", while (say) "amethyst" and "ring" were associated with lower prices, but is there a way I can build a model to look at this?

My sense is that I need to do the following things:

  • turn the text field into a bag of words (vector)
  • remove stop words
  • normalize each word for overall count(?)
  • run some kind of regression against price.

I'd really welcome some guidance on how to approach each step.

Best Answer

First, I'd split each text description into words. there are several ways to do it. the simplest is by using strsplit with the correct split argument.

what you get is a list of character vectors each containing a word. note: if you choose bad split arguments you'll end up with lot's of garbage, which might not be really bad, you can filter some of the garbage later.

all.words = strsplit(descriptions,c(" ",","))

Now, I'd have a combined list of words:

words = unlist(all.words)
word.count = table(words)

Now I'd choose only words that appear several times (in my example 3):

chosen.words = names(word.count)[word.count>3]

Now for each word and for each case in your data I'd add an indicator variable, telling whether the given word appeared in the description of the given item

With this new data, you have a new variable for each word, and you can add these variables to your regression, and the coefficient will tell you the relative contribution of this word to price.

HTH

Related Question