Binary Logistic Regression – Addressing Abnormally High Standard Error

binary datalogisticregressionseparationstandard error

I did a binary logistic regression test on SPSS with a sample size of 24. The study is about the correlation of teenage pregnancy and depressive symptoms. Of the 24, 11 are pregnant teenagers. Of the 24 only 1 is positive for the presence of depressive symptoms. Both variables are recoded to nominal, so that 1 means yes and 0 means no.

What could the abnormally high standard error mean? And is there any way to solve it?

Hoping for kind responses. Thank you.

Best Answer

You have almost-separation, as Scortchi notes.

Your data - 11 pregnant teenagers, none of which are depressed, and 13 non-pregnant ones, one of which is depressed - is consistent with a model that essentially says that if you are pregnant, you have a zero chance of depression, whereas if you are not pregnant, you do have a small chance. Logistic regression does not play well with chances of zero or one, and one symptom of separation is large standard errors.

And of course, it's not as if there really were a zero chance of depression during pregnancy. You will simply need to collect more data.

With a point prevalence of about 3% for MDD, 24 participants are way too few, in any case, unless you screened explicitly for depression. This should have been taken into account during sample size calculation.

Summary

You appear to be looking at the associations between symptoms (a, b, c, d, and e, coded as linear, numeric variables) and cancer status (yes versus no, coded in binary).

Associations versus predictions

I think you are looking at associations between the symptoms and cancer status rather than the ability of the symptoms to predict cancer status. If you wanted to really investigate predictive ability, you would need to divide your data set in half, fit models to one half of the data, and then use them to predict the cancer status of the patients in the other half of the data set. Note that this describes the simplest case of validation of a model using a single data set. You shouldn't actually do this. What you could really do is employ n-fold cross validation (for example, using the rms package in R) to make the most efficient use of your data.

Starting off

You may have already done this, but prior to playing around with logistic regression modeling I think you should take a step back and just look at your data. Using the program R to compute a few basic summary statistics...

# Load libraries
library(Rmisc)
library(metafor)

# Load data
data <- read.csv("example_data.csv", header = TRUE, na.strings = "")
attach(data)

# Summarize data
summary(data)
       a              b               c               d               e             cancer      
 Min.   :11.0   Min.   :13.00   Min.   :13.00   Min.   :12.00   Min.   :17.00   Min.   :0.0000  
 1st Qu.:19.0   1st Qu.:27.00   1st Qu.:28.00   1st Qu.:36.00   1st Qu.:33.00   1st Qu.:1.0000  
 Median :24.0   Median :31.00   Median :32.00   Median :40.00   Median :38.00   Median :1.0000  
 Mean   :24.8   Mean   :31.39   Mean   :32.44   Mean   :39.39   Mean   :37.71   Mean   :0.9169  
 3rd Qu.:30.0   3rd Qu.:36.00   3rd Qu.:37.00   3rd Qu.:43.50   3rd Qu.:42.00   3rd Qu.:1.0000  
 Max.   :49.0   Max.   :50.00   Max.   :50.00   Max.   :50.00   Max.   :50.00   Max.   :1.0000  
 NA's   :20     NA's   :18      NA's   :21      NA's   :20      NA's   :20      NA's   :6

And now to plot some exploratory scatter plots... Pay attention to any linear relationships between variables that pop out to your eye. Also pay attention (as Benjamin mentioned below) to the plots of the symptom variables versus cancer status.

plot(data)

Scatter plots

And look at some histograms to get a sense of the distribution of your data... Always good to do this before plugging them into a regression model

 hist(data)

Histograms

Going a bit further...

I would compute the mean and 95%CI for each symptom variable and stratify them by cancer status and plot those... Just by looking at this you will know visually which variables are going to be significant in your logistic regression model. Here I just plot the data...

forest(
x = c(24.44636,28.94667,31.63066,28.62963,32.59910,30.65852,39.79738,35.04111,37.99030,34.41185),
ci.lb = c(23.57979,25.72939,30.84611,26.15883,31.88579,28.52778,39.16493,32.27390,37.26171,32.10734),
ci.ub = c(25.31292,32.16395,32.41520,31.10043,33.31242,32.78926,40.42983,37.80832,38.71888,36.71637),
xlab = "Mean and 95% CI", slab = c("a cancer","a healthy","b cancer","b healthy","c cancer","c healthy","d cancer","d healthy","e cancer","e healthy"))

Forest plot

Looking at the plot above, you get a visual sense of the fact that you have way more cancer patients contributing to the data set than non-cancer patients.

Last...

I would just compute univariate effects estimates for each symptom variable for their associations with cancer outcome. Then I would multiply all of the resultant p values by five, since you are doing that many exploratory tests. You can do that in SPSS easily. For the results of the models, I would focus more on the direction, magnitude, and confidence intervals for the resultant effects estimates. Below I have plotted the effects estimates and their confidence intervals from univariate models of each separate symptom variable... Now you should go build models that are adjusted for age, gender, smoking, etc. and make another plot like this... I do agree with Benjamin that there is probably not a whole lot you can likely learn from these data given the paucity of healthy controls.

Logistic regression results

Logistic Regression – How to Choose Between Multiple and Univariate Logistic Regression

It is obviously not correct to rerun models until by chance they give what you expect...

Ideally, the model structure (i.e. selection of predictors, transformations, interactions) is chosen based on a number of points before computing the model:

Expert knowledge such as publications and research questions. This includes thinking.

Example: if you are mainly interested in the effect of a particular variable $X$ on the response $Y$ and you know that $X$ has a strong causal effect on $Z$, then it would be quite stupid to include $Z$ in the model along with $X$, because this would partially hide the effect of $X$ onto $Y$. That is maybe the case in your model.
Univariate distributions of variables (e.g. excluding potential predictor "sex" if there is only one male or if most values are missing; log-transform some right-skewed variables with outliers if it makes scientific sense etc.)
Bivariate distributions of the predictors (e.g. if both potential predictors "age" and "experience" are highly correlated, it might suffice to include just one of the two)

The hidden message of the above points: Don't take into consideration the association between the response variable and the potential predictors at this point. This will tend to bias the model to fit your expectations. It also answers Question 1: Such univariate screening is not suitable for variable selection. It might be part of the analysis though as complement to the multivariate model. This depends on the research question.

The answer to Question 2 depend on what the research question or the objective of the analysis is:

You could, for instance, be interested in testing some specific hypotheses. Then the "borderline significance" and multiple testing problem becomes an issue.
(And) or you might want to have a good predictive model. Then cross-validation (or similar) of the performance of the model is of much higher relevance than p values.
(And) or you might be interested in estimating the effects of some particular predictor.