The fact of the matter is that most of your explanatory variables are very narrow and don't provide much information, you have too many of them, and your target variable is very narrow as well, with extremely unbalanced classes. Blind, naive logistic regression simply might not be the best solution.
The problem with the probabilities from your fitted model is simply that the glm training failed. I repeated your code for i=1
and j=1
-- the results start like this:
Deviance Residuals:
Min 1Q Median 3Q Max
-2.409e-06 2.409e-06 2.409e-06 2.409e-06 2.409e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.174e+01 4.290e+05 0 1
MF_tense_vbn -1.033e-06 1.887e+05 0 1
MF_tense_vbd -1.985e-07 1.436e+05 0 1
MF_tense_vbg -1.875e-06 1.199e+05 0 1
MF_tense_vbp -6.888e-08 2.024e+05 0 1
...
None of the parameters in the model are significant at all, but of course you shouldn't trust anything here because the Newton-Raphson algorithm hasn't been able to converge. Why exactly this is, I don't know, but it could have something to do with the large number of very narrow variables. Perhaps you should start by eliminating all the "poor" variables by some criterion (e.g. >99% or >95% 0, or 1 as the case may be) and trying the regression again.
Of course, if you are not constrained to consider only logistic regression, there are other options. As a quick example, I tried training a boosted decision tree using the gbm
method from the caret
package. Applying this blindly, the resulting object almost always predicts NOT.1
, since this is so much more likely a priori. So, I weighted the training dataset to have equal numbers of 1
and NOT.1
entries, and trained the BDT using
gbm <- train(tp ~.-id-weight, data=tmp, method="gbm", weights = tmp$weight)
The resulting tree clearly is somewhat successful. On the training dataset, it correctly identifies almost all 1
entries, though it has a number of false positives:
> table(tmp$tp,predict(gbm,newdata=tmp))
1 NOT.1
1 18 1
NOT.1 14 200
The test dataset is small, unfortunately (too small). The model identifies one 1
entry correctly, misses a second, and has one false positive:
> table(test.data.tmp$tp,predict(gbm,newdata=test.data.tmp))
1 NOT.1
1 1 1
NOT.1 1 31
Perhaps more interesting are the most important variables in the BDT:
> varImp(gbm)
gbm variable importance
only 20 most important variables shown (out of 89)
Overall
STA.GEN_any_obj 100.00
MF_tense_vbd 85.38
MF_1n_nominal 83.65
SEM.o.L 80.08
SEM.o.H 76.91
STA.GEN_dobj 69.08
STA.GEN_advmod 67.33
MF_tense_vb 61.34
MF_3p_nominal 59.01
SEM.o.ac 55.19
MF_3n_adverbial 51.39
STA.GEN_n_subj 43.44
MF_2n_adverbial 40.14
MF_2n_nominal 40.08
STA.GEN_obj_pl 39.19
STA.LEX_prep_none 32.12
MF_3p_adverbial 30.99
MF_1p_adverbial 28.64
MF_tense_vbg 25.63
SEM.s.H 23.78
Performing logistic regression with only the five most "important" variables has a more successful result than the blind, overall logistic regression:
Call:
glm(formula = tmp$tp ~ STA.GEN_any_obj + MF_tense_vbd + MF_1n_nominal +
SEM.o.L + SEM.o.H, family = binomial(link = "logit"), data = tmp)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6514 0.1197 0.2458 0.5413 1.0422
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.93547 0.89712 5.501 3.77e-08 ***
STA.GEN_any_obj -1.45063 0.83801 -1.731 0.08345 .
MF_tense_vbd -1.52018 0.54963 -2.766 0.00568 **
MF_1n_nominal -1.63812 0.82791 -1.979 0.04786 *
SEM.o.L 2.26693 1.07096 2.117 0.03428 *
SEM.o.H -0.03051 0.72658 -0.042 0.96650
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 131.66 on 232 degrees of freedom
Residual deviance: 103.16 on 227 degrees of freedom
AIC: 115.16
Number of Fisher Scoring iterations: 6
Of course, I intend this only for demonstration purposes -- I picked the top 5 arbitrarily, and ignored interaction effects that are likely considered in the BDT.
So, I recommend you explore some alternative to logistic regression or find some way of choosing a subset of explanatory variables. The latter could be done as I've shown here, but a much better method would be to use the lasso regression, using the glmnet
package in R. I would recommend on studying ridge regression and the lasso, if you are unfamiliar with these concepts.
EDIT
I also wouldn't give up on using multinomial regression instead of one versus all (I don't know if your choice to use one-versus-all binomial regression is by design or by necessity). For my own curiosity, since I hadn't tried performing multinomial regression before, I used your entire dataset (all training folds combined and all testing folds combined) to train a lasso multinomial regression using the cv.glmnet
tool to pick the best model by automatic cross validation. The resulting model is 100% accurate on the training dataset and accurate on 265 out of 267 test cases, which is pretty good, I think. If you're interested in using the glmnet
tool you should be able to reproduce these results using (beware minor typos)
## Combine folds, impute zeros (I think that's what "trimmed" took away)
require(plyr)
train.data <- NULL
test.data <- NULL
for (j in 1:8) {train.data <- rbind.fill(train.data,training.folds.trimmed[[j]])}
for (j in 1:8) {test.data <- rbind.fill(test.data,test.folds.trimmed[[j]])}
train.data[is.na(train.data)] <- 0
test.data[is.na(test.data)] <- 0
## Get datasets
y.train <- train.data$tp
y.test <- test.data$tp
X.train <- train.data
X.train$tp <- NULL
X.train$id <- NULL
X.train <- as.matrix(X.train)
X.test <- test.data
X.test$tp <- NULL
X.test$id <- NULL
X.test <- as.matrix(X.test)
## Fit
require(glmnet)
cv.fit <- cv.glmnet(X.train,y.train,family="multinomial")
y.train.predict <- predict(cv.fit,X.train,type="class")
y.test.predict <- predict(cv.fit,X.test,type="class")
table(y.train,y.train.predict)
table(y.test,y.test.predict)
Best Answer
R is giving you two different warnings because these really are two distinct issues.
Very loosely, the algorithm that fits a logistic regression model (typically some version of Newton-Raphson) looks around for the coefficient estimates that will maximize the log likelihood. It will estimate the model at a given point in the parameter space, see which direction is 'uphill', and then move some distance in that direction. The potential problem with this is that when perfect separation exists, the maximum of the log likelihood is where the slope is infinite. Because a search algorithm has to be designed to stop at some point, it doesn't converge.
On the other hand, no matter where it stops, whether it converged or not, it is (theoretically) possible to calculate the model's predicted values for the data. However, because computers use finite precision arithmetic, when they perform the calculations, they eventually need to round off or drop extremely low decimal values. Thus, if the arithmetically correct value is sufficiently close to 0 or 1, when it is rounded, it can end up being 0 or 1 exactly. Values can have that property and be within the normal range of the data due to an extremely large (in absolute value) slope estimate due to complete separation, or they can just be so far out on X that even a small slope will lead to the same phenomenon.
Here we see that we got the second warning, but the algorithm converged. The betas are reasonably close to the true values, the standard errors aren't huge, and the number of Fisher scoring iterations is moderate. Nonetheless, the extreme x-values yield predicted log odds that are perfectly calculable, but when converted into probabilities, become essentially 0 and 1.