Solved – Understanding Bagged Logistic Regression (and a Python Implementation)

I am trying to understand an academic article I am reading regarding bagged logistic regression for marketing attribution — http://www.turn.com.akadns.net/sites/default/files/whitepapers/TURN_Tech_WP_Data-driven_Multi-touch_Attribution_Models.pdf

Particularly, this paragraph:

Step 1. For a given data set, sample a proportion (ps) of all the
sample observations and a proportion (pc) of all the covariates.
Fit a logistic regression model on the sampled covariates and the
sampled data. Record the estimated coefficients — we recommend to choose ps and pc to
take values around 0.5 if both the variability and the accuracy
are of the concern

Can someone please explain what this means in (hopefully) plain english? Based on my understanding, the idea is to just keep running the logistic regression on .5 random subsets of the sample data and then average all of the log odd coefficients that meet a .5 selection threshold?

Completely Optional Bonus points 1: On a side note, is this implementation similar to the idea of randomized logistic regression in scikit learn for python? If not, what is the difference?
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html

Completely Optional Bonus points 2: is there a way to incorporate ordered effects into a bagged logistic regression model (e.g. the order in which the predictor variables, in this case advertisements, appeared — however this is of seconday concern to the primary question)

Best Answer

Bagging is an ensemble method where you train model on independent samples of the training data and combine (average, vote, ...) their predictions. This generally produces more accurate predictions than the individual models. Technically bagging means that the samples are drawn with replacement and of the same size as the full data set. However the term is sometimes also applied to other sampling schemes.

Bagged Logistic Regression means bagging using logistic regression for the individual models, but it is bagging in the loose sense of the word. They are really combining subsampling (ie sampling without replacement) with randomized subspaces (sampling the columns/features).

In the quote ps is the fraction of the rows/items included in each sample and pc is the fraction of columns/features. They just use a more statistics flavored terminology where observations are the rows and covariates are the columns.

This is close to what sklearn.linear_model.RandomizedLogisticRegression does internally. The main differences are that RandomizedLogisticRegression does not support column sampling and also it is not a predictive model. It is only used to select relevant features.

Bagging does not really offer anything extra for dealing with sequencing information. You can create features that encode the sequencing information as you would with any other machine learning method, but if that is the main thing you are interested in you should look into specialized methods.

Best Answer

Related Solutions

Solved – Is a logistic regression biased when the outcome variable is split 5% – 95%

Related Question