Solved – Logistic regression and the 2 by 2, or 3 by 2 contingency table

logisticregression

I just have a question about logistic regression and the 2 by 2 or 3 by 2 or n by 2 contingency table: the table can be found here:

http://en.wikipedia.org/wiki/Contingency_table

My question is when I have a table like this:

      Right-handed   Left-handed    Total
Male            43             9       52 
Female          44             4       48 
Total           87            13      100

if I want to know if gender is associated with a person being right-handed or left-handed, I can calculate the odd ratio as an indication of whether being male is more likely to be left-handed:

(9/52) / (4/48) = 2.07

Then, I can also test using the Pearson Chi-square test to see if there is significant association? like p-value <0.05 right?

But I can also create an indicator variable with 100 people in the data set, 52 being male, 48 being females, and let 1= left-handed , 0= right-handed. Then get the odd-ratio from the output of the logistic regression.

So my question is, I am not sure which is the right way to find out which group is more likely to be left-handed?

Also is it possible to get some opinions about what is the differences between the two approaches? are they answering the same question?

1) Basically, I just want to know what is the difference between the two approaches here in answering questions? (what kind of question does each approaches designed to answer?).

2) What is the difference between the p-value given by say in this case the 2-by-2 contingency table and the p-value given by the significant test of the parameter estimate of gender(M,F) in the logistic regression?

Could someone kindly explain?

Best Answer

The odds ratio in that table is:

(9/43)/(4/44) = 2.30

What John_w computed was a risk ratio.

Now you can see that the manually computed odds ratio is exactly the same as the one produced by logistic regression. Both tests you suggested test the null-hypothesis that this odds ratio is equal to 1. The p-values should be close enough to not matter. For example doing this in Stata gives the following output:

. // prepare the data
. clear

. input female right freq

        female      right       freq
  1.         0          1         43
  2.         0          0          9
  3.         1          1         44
  4.         1          0          4
  5. end

. label define female 0 "male" 1 "female"

. label value female female

. label variable female "respondent's sex"

.
. label define right 0 "left-handed" 1 "right-handed"

. label value right right

. label variable right "respondent's handeness"

.
. // tabulate
. tab female right [fw=freq], lr chi2

           |     respondent's
respondent |       handeness
    's sex | left-hand  right-han |     Total
-----------+----------------------+----------
      male |         9         43 |        52
    female |         4         44 |        48
-----------+----------------------+----------
     Total |        13         87 |       100

          Pearson chi2(1) =   1.7774   Pr = 0.182
 likelihood-ratio chi2(1) =   1.8250   Pr = 0.177

.
. // the odds ratio:
. di (9/43)/(4/44)
2.3023256

.
. // logistic regression
. logit right female [fw=freq], or nolog

Logistic regression                               Number of obs   =        100
                                                  LR chi2(1)      =       1.82
                                                  Prob > chi2     =     0.1767
Log likelihood = -37.726174                       Pseudo R2       =     0.0236

------------------------------------------------------------------------------
       right | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   2.302326   1.468974     1.31   0.191     .6592751    8.040199
       _cons |   4.777778   1.751347     4.27   0.000      2.32921    9.800386
------------------------------------------------------------------------------

For fun I also asked for the likelihood ratio chi square statistic after tab, to show that test is exactly the same as the one reported by logistic regression (labeled LR chi2(1) and Prob > chi2 in the output of logit).

Related Solutions

Logistic Regression – Logistic vs Chi-Square in 2×2 and Ix2 Contingency Tables

Ultimately, it's apples and oranges.

Logistic regression is a way to model a nominal variable as a probabilistic outcome of one or more other variables. Fitting a logistic-regression model might be followed up with testing whether the model coefficients are significantly different from 0, computing confidence intervals for the coefficients, or examining how well the model can predict new observations.

The χ² test of independence is a specific significance test that tests the null hypothesis that two nominal variables are independent.

Whether you should use logistic regression or a χ² test depends on the question you want to answer. For example, a χ² test could check whether it is unreasonable to believe that a person's registered political party is independent of their race, whereas logistic regression could compute the probability that a person with a given race, age, and gender belongs to each political party.

Logistic Regression in R – How Does GLM Function with Binomial Family Work?

The glm function works by optimizing the log likelihood for the binomial. I suggest you read up on most any book on glm if you are interested in learning more about how these models are fit.

That being said, it is possible to re-arrange the 2x2 table in such a way that glm can be used.

nums = c(43,44,9,4)

x = matrix(nums, nrow = 2)
colnames(x) = c('Right-handed', 'Left-handed')
rownames(x) = c(1,0) #Code the outcome as a binary variable.

# Turns the table into a dataframe
# The cell counts exist as a column called Freq
# We will Weight each row by the number of observations in that cell
d = as.data.frame.table(x)


model = glm(Var2~Var1, weights = Freq, family = binomial(), data=d)
summary(model)
#> 
#> Call:
#> glm(formula = Var2 ~ Var1, family = binomial(), data = d, weights = Freq)
#> 
#> Deviance Residuals: 
#>      1       2       3       4  
#> -4.043  -2.767   5.619   4.459  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -1.5640     0.3666  -4.267 1.98e-05 ***
#> Var10        -0.8339     0.6380  -1.307    0.191    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 77.277  on 3  degrees of freedom
#> Residual deviance: 75.452  on 2  degrees of freedom
#> AIC: 79.452
#> 
#> Number of Fisher Scoring iterations: 5

^{Created on 2021-09-09 by the reprex package (v2.0.1)}

As to some of your questions:

Is the glm function in R calculating them in the end or is it only working on the raw columns?

glm works on raw columns and does not calculate 2x2 tables

Or is it important for the glm function to know which male is right-handed, which one left-handed and the same for the females?

If you have replicates (e.g. you have 10 right handed males) then you can use the weights argument to let glm know this, or you can repeat the row 10 times in your data.

Best Answer

Related Solutions

Logistic Regression – Logistic vs Chi-Square in 2×2 and Ix2 Contingency Tables

Logistic Regression in R – How Does GLM Function with Binomial Family Work?

Related Question