I am trying to create a model that predicts an author's age.
I'm using (Nguyen et al, 2011) as my basis.
Using a Bag of Words Model I count the occurrences of words per Document (which are Posts from Boards) and create the vector. I am using scikit-learn.
I limit the size of the vector by using as features the top-k (k=number)
most frequent used words (stopwords will not be used)
The vectors will be scaled.
X_train = preprocessing.scale(X_train)
I train the data on a Linear Regression Model (also tried Lasso)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
When I test the model on my test data I get a low r² score(0.01-0.15)
but an acceptable MAE score (compared with the paper).
When I run the plot function from scikit-learn's
Example, I get this:
Like in the example, I use the first Feature of the Dataset.
What can I do to improve the r² score and what did I do wrong that the plot looks like this?
Best Answer
The plot doesn't look wrong. Your X axis is the word count of one word, after scaling. The Y axis is age. The vertical stacks result from always having an integer word count; there are 8 stacks corresponding to word counts of 0-7. The blue trend line shows that this word is a weak positive indicator for age.
The plot would be slightly clearer if you did not scale your input. Linear regression doesn't benefit from unit-variance scaling anyway.