Solved – Logistic regression: grouped and ungrouped variables (using R)

generalized linear modellogisticr

I'm reading A. Agresti (2007), An Introduction to Categorical Data Analysis, 2nd. edition, and am not sure if I understand this paragraph (p.106, 4.2.1) correctly (although it should be easy):

In Table 3.1 on snoring and heart disease in the previous chapter, 254
subjects reported snoring every night, of whom 30 had heart disease.
If the data file has grouped binary data, a line in the data file
reports these data as 30 cases of heart disease out of a sample size
of 254. If the data file has ungrouped binary data, each line in the
data file refers to a separate subject, so 30 lines contain a 1 for
heart disease and 224 lines contain a 0 for heart disease. The ML
estimates and SE values are the same for either type of data file.

Transforming a set of ungrouped data (1 dependent, 1 independent) would take more then "a line" to include all the information!?

In the following example a (unrealistic!) simple data set is created and a logistic regression model is build.

How would grouped data actually look like (variable tab?)? How could the same model be build using grouped data?

> dat = data.frame(y=c(0,1,0,1,0), x=c(1,1,0,0,0))
> dat
  y x
1 0 1
2 1 1
3 0 0
4 1 0
5 0 0
> tab=table(dat)
> tab
   x
y   0 1
  0 2 1
  1 1 1
> mod1=glm(y~x, data=dat, family=binomial())

Best Answer

Table 3.1 is reproduced below:

enter image description here

Agresti considered the following numerical scores for snoring level: {0,2,4,5}.

There are two ways to fit a GLM with R: either your outcome is provided as a vector of 0/1 or a factor with two levels, with the predictors on the rhs of your formula; or you can give a matrix with two columns of counts for success/failure as the lhs of the formula. The latter corresponds to what Agresti call 'grouped' data. A third method, which also applies to grouped settings, would be to use the weights= argument to indicate how many positive and negative outcomes were observed for each category of the classification table.

Data in matrix view would read:

snoring <- matrix(c(24,35,21,30,1355,603,192,224), nc=2)

From this, we can generate a data.frame in long format (2484 rows = sum(snoring) observations) as follows:

snoring.df <- data.frame(snoring=gl(4, 1, labels=c("Never", "Occasional",
                                                   "Nearly every night", 
                                                   "Every night")),
                         disease=gl(2, 4, labels=c("Yes", "No")),
                         counts=as.vector(snoring))
snoring.df <- snoring.df[rep(seq_len(nrow(snoring.df)), snoring.df$counts), 1:2]

And the following two models will yield identical results:

levels(snoring.df$snoring) <- c(0, 2, 4, 5)
y <- abs(as.numeric(snoring.df$disease)-2)
x <- as.numeric(as.character(snoring.df$snoring))
fit.glm1 <- glm(y ~ x, family=binomial)

fit.glm2 <- glm(snoring ~ c(0, 2, 4, 5), family=binomial)

That is, $\text{logit}[\hat\pi(x)]=-3.87+0.40x$, using Agresti's notation.

The second notation is frequently used on aggregated table with an instruction like cbind(a, b), where a and b are columns of counts for a binary event (see e.g., Generalized Linear Models). It looks like it would also work when using table instead of matrix (as in your example), e.g.

glm(as.table(snoring) ~ c(0, 2, 4, 5), family=binomial)

Related Solutions

Solved – Analysing a grouped 0 to 10 scale using ordinal logistic regression

The specific values on your scale (i.e., 0 through 10 values) are irrelevant; only their order matters. The various ordered models with their myriad of names only focus on the order and discard the specific scale values. (The specific values are only really relevant when data has been rounded or categorized, neither of which has occurred in your case.)

Other than ensuring that you have a few observations in each of your groups I would not worry too much about the proportion of observations in each group (I can't think of any statistical justification and pretty much no real-world data set would meet such a requirement anyway).

One other thing to keep in mind is that you will get a bit more statistical power if you estimate the model using all 11 data points and apply the Net Promoter Score coding only at the time of prediction (however, if doing this, you would want to check that the parameter estimates for the independent variables are approximately the same either way).

Solved – Logistic Regression using two categorical variables

Your intuition is right.

The logistic regression model gives identical inference to a Pearson chi-sq test of "independence" for categorical data (in the long run). In both cases, the null hypothesis is that the conditional probabilities are equal to the marginal probabilities. You can show with some algebra that this necessarily implies that the odds ratio is 1.

The minute differences in the actual logistic regression and Pearson test statistics owes to how they're computed. The $p$values you get in R from calling summary.glm come from a Wald test whereas the Pearson test is a related Score test.

The problem with your example is that you have a singular logistic model. One advantage of the score test is that it can provide test statistics for models like this one which "explode". For a more sane example, consider the following:

set.seed(123)
foo2 <- as.data.frame(sapply(foo, sample)) ## permute to avoid singularity
fit <- glm(Dep ~ Pred, data=foo2, family=binomial)
library(lmtest)
waldtest(fit, test='Chisq')
chisq.test(table(foo2))

Gives us:

> waldtest(fit, test='Chisq')
Wald test

Model 1: Dep ~ Pred
Model 2: Dep ~ 1
  Res.Df Df  Chisq Pr(>Chisq)
1    196                     
2    199 -3 1.7774     0.6199
> chisq.test(table(foo2))

    Pearson's Chi-squared test

data:  table(foo2)
X-squared = 1.7882, df = 3, p-value = 0.6175

Best Answer

Related Solutions

Solved – Analysing a grouped 0 to 10 scale using ordinal logistic regression

Solved – Logistic Regression using two categorical variables

Related Question