Solved – Logistic regression: grouped and ungrouped variables (using R)

generalized linear modellogisticr

I'm reading A. Agresti (2007), An Introduction to Categorical Data Analysis, 2nd. edition, and am not sure if I understand this paragraph (p.106, 4.2.1) correctly (although it should be easy):

In Table 3.1 on snoring and heart disease in the previous chapter, 254
subjects reported snoring every night, of whom 30 had heart disease.
If the data file has grouped binary data, a line in the data file
reports these data as 30 cases of heart disease out of a sample size
of 254. If the data file has ungrouped binary data, each line in the
data file refers to a separate subject, so 30 lines contain a 1 for
heart disease and 224 lines contain a 0 for heart disease. The ML
estimates and SE values are the same for either type of data file.

Transforming a set of ungrouped data (1 dependent, 1 independent) would take more then "a line" to include all the information!?

In the following example a (unrealistic!) simple data set is created and a logistic regression model is build.

How would grouped data actually look like (variable tab?)? How could the same model be build using grouped data?

> dat = data.frame(y=c(0,1,0,1,0), x=c(1,1,0,0,0))
> dat
  y x
1 0 1
2 1 1
3 0 0
4 1 0
5 0 0
> tab=table(dat)
> tab
   x
y   0 1
  0 2 1
  1 1 1
> mod1=glm(y~x, data=dat, family=binomial())

Best Answer

Table 3.1 is reproduced below:

enter image description here

Agresti considered the following numerical scores for snoring level: {0,2,4,5}.

There are two ways to fit a GLM with R: either your outcome is provided as a vector of 0/1 or a factor with two levels, with the predictors on the rhs of your formula; or you can give a matrix with two columns of counts for success/failure as the lhs of the formula. The latter corresponds to what Agresti call 'grouped' data. A third method, which also applies to grouped settings, would be to use the weights= argument to indicate how many positive and negative outcomes were observed for each category of the classification table.

Data in matrix view would read:

snoring <- matrix(c(24,35,21,30,1355,603,192,224), nc=2)

From this, we can generate a data.frame in long format (2484 rows = sum(snoring) observations) as follows:

snoring.df <- data.frame(snoring=gl(4, 1, labels=c("Never", "Occasional",
                                                   "Nearly every night", 
                                                   "Every night")),
                         disease=gl(2, 4, labels=c("Yes", "No")),
                         counts=as.vector(snoring))
snoring.df <- snoring.df[rep(seq_len(nrow(snoring.df)), snoring.df$counts), 1:2]

And the following two models will yield identical results:

levels(snoring.df$snoring) <- c(0, 2, 4, 5)
y <- abs(as.numeric(snoring.df$disease)-2)
x <- as.numeric(as.character(snoring.df$snoring))
fit.glm1 <- glm(y ~ x, family=binomial)

fit.glm2 <- glm(snoring ~ c(0, 2, 4, 5), family=binomial)

That is, $\text{logit}[\hat\pi(x)]=-3.87+0.40x$, using Agresti's notation.

The second notation is frequently used on aggregated table with an instruction like cbind(a, b), where a and b are columns of counts for a binary event (see e.g., Generalized Linear Models). It looks like it would also work when using table instead of matrix (as in your example), e.g.

glm(as.table(snoring) ~ c(0, 2, 4, 5), family=binomial)