Solved – How to do stepwise regression with a binary dependent variable

logisticstepwise regression

I want to use stepwise regression to reduce the number of variables. My dependent variable is a dummy variable (Fraud=1, None fraud=0) and I have 25 predictive variables. How can I do this?

Best Answer

Do not use step-wise regression.

Because step-wise regression almost certainly will insure biased results. All statistics produced through step-wise model building have a nested chain of invisible/unstated "conditional on excluding X" and/or "conditional on including X" statements built into them with the result that:

p-values are biased
variances are biased
parameter estimates are biased
Coefficients of determination are biased
false predictors are likely to be included
true predictors are likely to be excluded

What to use instead of step-wise regression

Use substantive theory to guide which predictor variables to include in your model, and report non-significant findings. If needed you can table only significant results in the main text of an article or report, and include the full model output in an appendix. But step-wise regression is more or less a good way to get consistently unreliable model results.

Some references on the topic
Babyak, M. A. (2004). What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66:411–421.

Flom, P. L. and Cassell, D. L. (2007). Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use.

Henderson, D. A. and Denison, D. R. (1989). Stepwise regression in social and psychological research. Psychological Reports, 64:251–257.

Huberty, C. J. (1989). Problems with stepwise methods—better alternatives. Advances in Social Science Methodology, 1:43–70.

Hurvich, C. M. and Tsai, C.-L. (1990). The impact of model selection on inference in linear regression. The American Statistician, 44(3):214–217.

Malek, M. H. and Coburn, D. E. B. J. W. (2007). On the inappropriateness of stepwise regression analysis for model building and testing. European Journal of Applied Physiology, 101(2):263–264.

McIntyre, S. H., Montgomery, D. B., Srinivasan, V., and Weitz, B. A. (1983). Evaluating the statistical significance of models developed by stepwise regression. Journal of Marketing Research, 20(1):1–11.

Pope, P. T. and Webster, J. T. (1972). The use of an $F$-statistic in stepwise regression procedures. Technometrics, 14(2):327–340.

Rencher, A. C. and Pun, F. C. (1980). Inflation of $R^{2}$ in best subset regression. Technometrics, 22(1):49–53.

Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282.

Sribney, B., Harrell, F., and Conroy, R. (2011). Problems with stepwise regression.

Steyerberg, E. W., Eijkemans, M. J., and Habbema, J. D. F. (1999). Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of clinical epidemiology, 52(10):935–942.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55(4):525–534.

Whittingham, M., Stephens, P., Bradbury, R., and Freckleton, R. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75(5):1182–1189.

Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin, 86(1):168–174.

Related Solutions

Solved – Sane stepwise regression

I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.

_{source for image}

Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.

I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?

Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.

Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.

set.seed(7938)  # this makes the simulation exactly reproducible
my.fun = function(){
  x = sample.int(512, size=90, replace=TRUE)
  return(sum(x[1:45]%in%x[46:90])==0)
}
n.unique = replicate(10000, my.fun())
mean(n.unique)  # [1] 0.0181

The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.

Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.

Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.

Solved – Multiple regression with no origin and mix of directly entered and stepwise entered variables using R

I think you can set up your base model, that is the one with your 12 IVs and then use add1() with the remaining predictors. So, say you have a model mod1 defined like mod1 <- lm(y ~ 0+x1+x2+x3) (0+ means no intercept), then

add1(mod1, ~ .+x4+x5+x6, test="F")

will add and test one predictor after the other on top of the base model.

More generally, if you know in advance that a set of variables should be included in the model (this might result from prior knowledge, or whatsoever), you can use step() or stepAIC() (in the MASS package) and look at the scope= argument.

Here is an illustration, where we specify a priori the functional relationship between the outcome, $y$, and the predictors, $x_1, x_2, \dots, x_{10}$. We want the model to include the first three predictors, but let the selection of other predictors be done by stepwise regression:

set.seed(101)
X <- replicate(10, rnorm(100))
colnames(X) <- paste("x", 1:10, sep="")
y <- 1.1*X[,1] + 0.8*X[,2] - 0.7*X[,5] + 1.4*X[,6] + rnorm(100)
df <- data.frame(y=y, X)

# say this is one of the base model we think of
fm0 <- lm(y ~ 0+x1+x2+x3+x4, data=df)

# build a semi-constrained stepwise regression
fm.step <- step(fm0, scope=list(upper = ~ 0+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, 
                                lower = ~ 0+x1+x2+x3), trace=FALSE)
summary(fm.step)

The results are shown below:

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x1   1.0831     0.1095   9.888 2.87e-16 ***
x2   0.6704     0.1026   6.533 3.17e-09 ***
x3  -0.1844     0.1183  -1.558    0.123    
x6   1.6024     0.1142  14.035  < 2e-16 ***
x5  -0.6528     0.1029  -6.342 7.63e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.004 on 95 degrees of freedom
Multiple R-squared: 0.814,  Adjusted R-squared: 0.8042 
F-statistic: 83.17 on 5 and 95 DF,  p-value: < 2.2e-16

You can see that $x_3$ has been retained in the model, even if it proves to be non-significant (well, the usual caveats with univariate tests in multiple regression setting and model selection apply here -- at least, its relationship with $y$ was not specified).

Best Answer

Related Solutions

Solved – Sane stepwise regression

Solved – Multiple regression with no origin and mix of directly entered and stepwise entered variables using R

Related Question