Solved – Using Linear Regression on text data

data visualizationpythonregressionscatterplotscikit learn

I am trying to create a model that predicts an author's age.
I'm using (Nguyen et al, 2011) as my basis.

Using a Bag of Words Model I count the occurrences of words per Document (which are Posts from Boards) and create the vector. I am using scikit-learn.

I limit the size of the vector by using as features the top-k (k=number)
most frequent used words (stopwords will not be used)

The vectors will be scaled.

X_train = preprocessing.scale(X_train)

I train the data on a Linear Regression Model (also tried Lasso)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)

When I test the model on my test data I get a low r² score(0.01-0.15)
but an acceptable MAE score (compared with the paper).

When I run the plot function from scikit-learn's
Example, I get this:
Plot

Like in the example, I use the first Feature of the Dataset.

What can I do to improve the r² score and what did I do wrong that the plot looks like this?

Best Answer

The plot doesn't look wrong. Your X axis is the word count of one word, after scaling. The Y axis is age. The vertical stacks result from always having an integer word count; there are 8 stacks corresponding to word counts of 0-7. The blue trend line shows that this word is a weak positive indicator for age.

The plot would be slightly clearer if you did not scale your input. Linear regression doesn't benefit from unit-variance scaling anyway.

Related Question