R Regression – Comparing Multinomial Logistic Regression in R vs Python

generalized linear modelpythonrregression

Does anybody have experience with the SKlearn multinomial regression (model = linear_model.LogisticRegression())?

My data looks like this:

Month   Year     Y
6       1990     Category1
6       1990     Category1
6       1990     Category3
...     ...       ...
10      1993     Category2
10      1993     Category3

Each row is essentially a day, but I'm not using day as a predictor so it looks like a lot of repeats. Basically I'm trying to train a model to give probabilities of being in categories 1-3 given a month and a year. Note that months that are not June-October are filtered out.

In R, I was able to do it (at least I think it is working correctly).

library(nnet)
model <- multinom(Category ~ I(Year-1990) + as.factor(Month), data=traindata)
predict(model, newdata = data.frame("Year"=2003,'Month'=9), "probs")

Probabilities:
Category 1: ~ 0.6 , Category 2: ~ 0.2, Category 3: ~ 0.2

These are the predicted probabilities for some day given year 2003 and month September for being in each of the three categories.

This part is crucial: If I change the prediction from month 9 to month 8, the probability of being in category 3 increases. If I change it to month 7, it decreases again. This is what I expect.

However, when I try to implement this in Python, each category either has the highest or lowest probability at month 6, and it strictly descends or ascends as you predict future months, with no "spike" in probability for category 3 at month 8. I don't think this is correct, and I trust the R output.

Here is my Python implementation:

import pandas as pd
from sklearn import linear_model

model = linear_model.LogisticRegression()
self.model = model.fit(xtrain,ytrain) 
(Where xtrain is the first two columns of the above DF with 1990 subtracted from the year column, and Ytrain is the third column). 

Using model.predict_proba(xtest) for e.g. year=2003 and month=9 gives probabilities that differ from what R gives, and changing the month gives the constant trend I mentioned. I also tried MNLogit from the Python library statsmodels and it gave a similar result. It makes me think that I'm not presenting the data in the same way as I am in R, but I subtracted the base year so that should not be a problem, and when I change Month to a category in Python it makes no difference to the output.

Does anyone know what might be happening? I would really appreciate any help

Best Answer

In case you are not sure whether a variable is being treated as categorical, you can manually one-hot-encode (=dummy coding) the categories to make sure you are using the variable as categorical. Then, run this model and see whether that changes the results. If so, the variable was not being treated as categorical / as a factor.

Another idea (though I suspect that's not it, because it should not exactly result in what you described) is that there could be penalization going on. E.g. for 0 vs. 1 logistic regression, scikit-learn surprisingly defaults to having L2 penalization (aka ridge regression).