Solved – Logistic regression with grouped data

logisticpython

Likely a simple question, have searched for an answer I promise!

I'm looking to do a Logistic regression for a dataset in which data is grouped by an ID, where there is one positive flag per group and the groups vary in size. A simple example would be a prediction of who is the heaviest in a group of individuals, given their age and height:

| GroupId | Age | Height | Heaviest |
|---------|-----|--------|----------|
| 1       | 27  | 198    | 1        |
| 1       | 42  | 165    | 0        |
| 1       | 34  | 133    | 0        |
| 2       | 63  | 176    | 1        |
| 2       | 27  | 189    | 0        |
| 2       | 55  | 165    | 0        |
| 2       | 44  | 166    | 0        |

My question is how we can leverage the grouping information into a logistic regression, as many of the positive flags will be on individuals who shouldn't be characterised as 'Heaviest' in a global sense, or am I thinking about it the wrong way? For what it's worth, I'm working with python's statsmodels library.

Best Answer

So after searching through leads kindly provided by this thread I've concluded that a Cox proportional hazards model would probably be the most appropriate, as this allows for stratification of the data by an ID as above.

For the curious, I came across lifelines for python which has a good implementation and have been doing some moderately successful tests with it.

Thanks all!

Related Solutions

Solved – Logistic regression: grouped and ungrouped variables (using R)

Table 3.1 is reproduced below:

enter image description here

Agresti considered the following numerical scores for snoring level: {0,2,4,5}.

There are two ways to fit a GLM with R: either your outcome is provided as a vector of 0/1 or a factor with two levels, with the predictors on the rhs of your formula; or you can give a matrix with two columns of counts for success/failure as the lhs of the formula. The latter corresponds to what Agresti call 'grouped' data. A third method, which also applies to grouped settings, would be to use the weights= argument to indicate how many positive and negative outcomes were observed for each category of the classification table.

Data in matrix view would read:

snoring <- matrix(c(24,35,21,30,1355,603,192,224), nc=2)

From this, we can generate a data.frame in long format (2484 rows = sum(snoring) observations) as follows:

snoring.df <- data.frame(snoring=gl(4, 1, labels=c("Never", "Occasional",
                                                   "Nearly every night", 
                                                   "Every night")),
                         disease=gl(2, 4, labels=c("Yes", "No")),
                         counts=as.vector(snoring))
snoring.df <- snoring.df[rep(seq_len(nrow(snoring.df)), snoring.df$counts), 1:2]

And the following two models will yield identical results:

levels(snoring.df$snoring) <- c(0, 2, 4, 5)
y <- abs(as.numeric(snoring.df$disease)-2)
x <- as.numeric(as.character(snoring.df$snoring))
fit.glm1 <- glm(y ~ x, family=binomial)

fit.glm2 <- glm(snoring ~ c(0, 2, 4, 5), family=binomial)

That is, $\text{logit}[\hat\pi(x)]=-3.87+0.40x$, using Agresti's notation.

The second notation is frequently used on aggregated table with an instruction like cbind(a, b), where a and b are columns of counts for a binary event (see e.g., Generalized Linear Models). It looks like it would also work when using table instead of matrix (as in your example), e.g.

glm(as.table(snoring) ~ c(0, 2, 4, 5), family=binomial)

Solved – Logistic regression with binomial data in Python

The statsmodel package has glm() function that can be used for such problems. See an example below:

import statsmodels.api as sm

glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

More details can be found on the following link. Please note that the binomial family models accept a 2d array with two columns. Each observation is expected to be [success, failure]. In the above example that I took from the link provided below, data.endog corresponds to a two dimensional array (Success: NABOVE, Failure: NBELOW).

Relevant documentation: https://www.statsmodels.org/stable/examples/notebooks/generated/glm.html

Best Answer

Related Solutions

Solved – Logistic regression: grouped and ungrouped variables (using R)

Solved – Logistic regression with binomial data in Python

Related Question