Solved – One-vs-All Logistic Regression Probability Scores either close to 0 or 1

logisticr

I am trying to implement one-vs-all logistic regression by myself on a small dataset that I have already divided into 8 folds (I do not want to use nnet::multinom). I have 14 classes and as many as about 90 features. I have create 14 different logistic regression models for each of the classes. The probabilities I am getting in the end are either so close to 0 or so close to 1, therefore, making it impossible to specify a 'threshold' for classification.

For example, in a normal binary logistic regression task where you only have 0 or 1 in the vector you're trying to predict, after using the predict function on your test-set, you can then write something like this:

fit.res <- ifelse(fit.res > 0.5, 1, 0)

So here you choose between 0 or 1 by a threshold of 0.5. But in the case of my task, I have 14 different classifiers where each classifier gives the probability of for example "1" or "NOT.1", another classifier gives the probability of being "4" or "NOT.5". Naturally I would use the code above to set a threshold and based on the given probabilities choose between the classes for each classifier.

However, probabilities I am getting are either something like 0.99999999 or 0.0000002, therefore, no threshold for classification can be chosen. Am I doing something wrong?

At the end my misclassification error and accuracy is really really bad..

Training dataset; here
Test dataset; here

predictions <- NULL
predictions <- as.data.frame(predictions)
prediction.for.fold <- list(predictions,predictions,predictions,predictions,predictions,
predictions,predictions,predictions)



for(j in 1:8){
for(i in 1:14){
class <- unique(training.folds.trimmed[[j]]$tp)[i]
class <- as.character(class)

tmp <- NULL
tmp <- training.folds.trimmed[[j]]
tmp$tp <- as.character(tmp$tp)
tmp$tp[tmp$tp != class] <- paste0("NOT.", class)


test.data.tmp <- NULL
test.data.tmp <- test.folds.trimmed[[j]]
test.data.tmp$tp <- as.character(test.data.tmp$tp)
test.data.tmp$tp[test.data.tmp$tp != class] <- paste0("NOT.", class)

tmp$tp <- as.factor(tmp$tp)
test.data.tmp$tp <- as.factor(test.data.tmp$tp)

log.reg <- glm(tmp$tp ~.-id, data=tmp, family = binomial(link = 'logit'))
prediction <- predict.glm(log.reg, newdata=test.data.tmp, type = "response")
prediction.for.fold[[j]] <- rbind(prediction.for.fold[[j]],prediction)
rownames(prediction.for.fold[[j]])[i] <- class
prediction <- 0
}
  prediction.for.fold[[j]] <- t(prediction.for.fold[[j]])
}

At the end I also get a lot of warnings.

glm.fit: algorithm did not converge

glm.fit: fitted probabilities numerically 0 or 1 occurred

Probabilities of NOT-being-in-a-certain-class as returned by the classifiers for the 1st fold (column names correspond to class names):
enter image description here

Best Answer

The fact of the matter is that most of your explanatory variables are very narrow and don't provide much information, you have too many of them, and your target variable is very narrow as well, with extremely unbalanced classes. Blind, naive logistic regression simply might not be the best solution.

The problem with the probabilities from your fitted model is simply that the glm training failed. I repeated your code for i=1 and j=1 -- the results start like this:

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-2.409e-06   2.409e-06   2.409e-06   2.409e-06   2.409e-06  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)             3.174e+01  4.290e+05       0        1
MF_tense_vbn           -1.033e-06  1.887e+05       0        1
MF_tense_vbd           -1.985e-07  1.436e+05       0        1
MF_tense_vbg           -1.875e-06  1.199e+05       0        1
MF_tense_vbp           -6.888e-08  2.024e+05       0        1
...

None of the parameters in the model are significant at all, but of course you shouldn't trust anything here because the Newton-Raphson algorithm hasn't been able to converge. Why exactly this is, I don't know, but it could have something to do with the large number of very narrow variables. Perhaps you should start by eliminating all the "poor" variables by some criterion (e.g. >99% or >95% 0, or 1 as the case may be) and trying the regression again.

Of course, if you are not constrained to consider only logistic regression, there are other options. As a quick example, I tried training a boosted decision tree using the gbm method from the caret package. Applying this blindly, the resulting object almost always predicts NOT.1, since this is so much more likely a priori. So, I weighted the training dataset to have equal numbers of 1 and NOT.1 entries, and trained the BDT using

gbm <- train(tp ~.-id-weight, data=tmp, method="gbm", weights = tmp$weight)

The resulting tree clearly is somewhat successful. On the training dataset, it correctly identifies almost all 1 entries, though it has a number of false positives:

> table(tmp$tp,predict(gbm,newdata=tmp))

          1 NOT.1
  1      18     1
  NOT.1  14   200

The test dataset is small, unfortunately (too small). The model identifies one 1 entry correctly, misses a second, and has one false positive:

> table(test.data.tmp$tp,predict(gbm,newdata=test.data.tmp))

         1 NOT.1
  1      1     1
  NOT.1  1    31

Perhaps more interesting are the most important variables in the BDT:

> varImp(gbm)
gbm variable importance

  only 20 most important variables shown (out of 89)

                  Overall
STA.GEN_any_obj    100.00
MF_tense_vbd        85.38
MF_1n_nominal       83.65
SEM.o.L             80.08
SEM.o.H             76.91
STA.GEN_dobj        69.08
STA.GEN_advmod      67.33
MF_tense_vb         61.34
MF_3p_nominal       59.01
SEM.o.ac            55.19
MF_3n_adverbial     51.39
STA.GEN_n_subj      43.44
MF_2n_adverbial     40.14
MF_2n_nominal       40.08
STA.GEN_obj_pl      39.19
STA.LEX_prep_none   32.12
MF_3p_adverbial     30.99
MF_1p_adverbial     28.64
MF_tense_vbg        25.63
SEM.s.H             23.78

Performing logistic regression with only the five most "important" variables has a more successful result than the blind, overall logistic regression:

Call:
glm(formula = tmp$tp ~ STA.GEN_any_obj + MF_tense_vbd + MF_1n_nominal + 
    SEM.o.L + SEM.o.H, family = binomial(link = "logit"), data = tmp)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6514   0.1197   0.2458   0.5413   1.0422  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      4.93547    0.89712   5.501 3.77e-08 ***
STA.GEN_any_obj -1.45063    0.83801  -1.731  0.08345 .  
MF_tense_vbd    -1.52018    0.54963  -2.766  0.00568 ** 
MF_1n_nominal   -1.63812    0.82791  -1.979  0.04786 *  
SEM.o.L          2.26693    1.07096   2.117  0.03428 *  
SEM.o.H         -0.03051    0.72658  -0.042  0.96650    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.66  on 232  degrees of freedom
Residual deviance: 103.16  on 227  degrees of freedom
AIC: 115.16

Number of Fisher Scoring iterations: 6

Of course, I intend this only for demonstration purposes -- I picked the top 5 arbitrarily, and ignored interaction effects that are likely considered in the BDT.

So, I recommend you explore some alternative to logistic regression or find some way of choosing a subset of explanatory variables. The latter could be done as I've shown here, but a much better method would be to use the lasso regression, using the glmnet package in R. I would recommend on studying ridge regression and the lasso, if you are unfamiliar with these concepts.

EDIT

I also wouldn't give up on using multinomial regression instead of one versus all (I don't know if your choice to use one-versus-all binomial regression is by design or by necessity). For my own curiosity, since I hadn't tried performing multinomial regression before, I used your entire dataset (all training folds combined and all testing folds combined) to train a lasso multinomial regression using the cv.glmnet tool to pick the best model by automatic cross validation. The resulting model is 100% accurate on the training dataset and accurate on 265 out of 267 test cases, which is pretty good, I think. If you're interested in using the glmnet tool you should be able to reproduce these results using (beware minor typos)

## Combine folds, impute zeros (I think that's what "trimmed" took away)
require(plyr)
train.data <- NULL
test.data <- NULL
for (j in 1:8) {train.data <- rbind.fill(train.data,training.folds.trimmed[[j]])}
for (j in 1:8) {test.data <- rbind.fill(test.data,test.folds.trimmed[[j]])}
train.data[is.na(train.data)] <- 0
test.data[is.na(test.data)] <- 0

## Get datasets
y.train <- train.data$tp
y.test <- test.data$tp
X.train <- train.data
X.train$tp <- NULL
X.train$id <- NULL
X.train <- as.matrix(X.train)
X.test <- test.data
X.test$tp <- NULL
X.test$id <- NULL
X.test <- as.matrix(X.test)

## Fit
require(glmnet)
cv.fit <- cv.glmnet(X.train,y.train,family="multinomial")
y.train.predict <- predict(cv.fit,X.train,type="class")
y.test.predict <- predict(cv.fit,X.test,type="class")
table(y.train,y.train.predict)
table(y.test,y.test.predict)