Machine Learning – How to Perform a Logistic Regression with SMOTE

logisticmachine learningrsmote

I want to understand which variables lead to an infection by parasites in a tree. Hence, I want to use stepwise logistic regression based on AIC. First, I describe what I would do, and then my code will follow.

Dependent variable: Infection (Y/N or 1/0). This is approximately imbalanced by 1:10000.

Independent variables: Stand age, leaf density, avg_temperature, soil moisture, precipitation, species (1/2/3/4), carbon and nitrogen content (C & N %).

infected  age    LAI   avg_temp  avg_moist  precip  species     C%      N%  
  <fct>  <dbl>  <dbl>   <dbl>      <dbl>    <dbl>    <dbl>     <dbl>   <dbl> 


  0       15     2.46    25.0       6.8      989       4       4.66    0.10
  0       13     1.50    18.3      10.7      631       3      11.12    0.8
  0       21     3.80    10.5      25.8      1207      2      14.73    1.9    
  0       56     5.21    24.2       9.2      434       1       3.21    0.12 
  0       57     4.31    20.6      10.4      499       1       4.63    0.17    
  0        2     0.58    25.3       2.1      801       4       2.58    0.09     
  1. I perform a stepwise logistic regression on my dataset to see which variables predict infections. The accuracy is 99.5 %, only predicting NO.
  2. I split my dataset into training and testing (80 : 20 split).
  3. I use SMOTE on the train.data to balance the dataset.
  4. I perform a stepwise logistic regression again on train.data and get different coefficients but also a lower accuracy. The model now predicts not only NO but also YES.

My questions are then:

  1. Is this the right approach and order of steps? If not what should I change?
  2. Which regression model should I trust (step 1 or step 4) and report? In my opinion step 1 with all data is the 'correct' one and the coefficients are the ones to report and interpret.

Code:

library(readr)
trees <- read_csv("trees_simple.csv")
View(trees)
library(tidyverse)
library(caret)
library(MASS)
library(DMwR)

full.model <- glm(infected ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = trees, family = binomial)
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)

set.seed(123)
training.samples <- trees$infected %>% createDataPartition(p = 0.8, list = FALSE)
train <- trees[training.samples, ]
test <- trees[-training.samples, ]

trees$infected <- as.factor(trees$infected)

newData <- SMOTE(infected ~ ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = data.frame(train), perc.over = 100, perc.under = 200)

full.model.smote <- glm(infected ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = newData, family = binomial)
step.model.smote <- stepAIC(full.model.smote, direction = "both", trace = FALSE)

probabilities.smote <- step.model.smote %>% predict(newData, type = "response")
predicted.classes.smote <- ifelse(probabilities.smote > 0.5, "1", "0")

Best Answer

I would recommend some changes to your approach.

First, with only 10 effective predictors (species with 4 levels counts as 3) there should be no need for predictor selection provided that you have on the order of 100-200 infected trees in your data sample. The usual rule of thumb for logistic regression is about 15 of the minority class per predictor evaluated. If you don't have that many infected trees you will probably be much better off using a logistic ridge regression, which will keep all predictors but penalize their regression coefficients to minimize overfitting.

Second, predictor selection of this type is particularly risky (a) in general and (b) in logistic regression, as omitting any predictor associated with outcome can lead to a bias in coefficient estimates of the retained predictors, often toward lower magnitudes.

Third, don't use accuracy as your criterion for model performance. Particularly, don't use a probability cutoff of 0.5 to gauge accuracy, as that implicitly assumes that false-positive and false-negative classifications are equally costly. That's probably not the case for you, as you want to find those rare infected trees. The logistic regression model provides probabilities. When applying your model, you can choose a probability cutoff that works for your intended purpose. In your case you would probably want to use a much lower cutoff than 0.5.

Fourth, please read this discussion about issues with unbalanced data like yours and this discussion about SMOTE. If you still decide that SMOTE is the way to go and you decide to omit predictors despite my recommendation, SMOTE certainly should be done at the first step before you have removed any predictors. SMOTE generates synthetic data by a type of interpolation among minority-class cases, so you want to provide the algorithm as much information as possible to start.

Finally, a logistic regression as you have written it might not work well if interactions among predictors are important. If you don't already know which interactions are likely to be important, you might consider another approach like boosted trees, which can be implemented to look for interactions while minimizing overfitting. If you go that route, however, make sure to use an approach that provides probability estimates rather than all-or-none classifications (see "Third" above).

Related Question