Odds Ratio Interpretation – Logistic Regression vs Simple Methods

logisticodds-ratioregression

I am trying to understand the interpretation of an "Odds Ratio calculated the Simple Way" vs the interpretation of an "Odds Ratio calculated from a Logistic Regression".

Suppose I have a (sample) dataset that contains information on medical information on some patients and if they have a certain disease or not:

 gender age   weight disease patient_id
      m  42 73.89227       y          1
      m  39 78.01266       n          2
      m  42 84.91308       y          3
      f  49 95.78418       n          4
      m  63 80.91756       n          5
      f  42 71.42108       n          6

(Simple Way) Approach 1: Based on the above dataset, suppose I were to summarize the number of male patients and female patients who have the disease and don't have the disease in a contingency table:

$$
\begin{array}{c|c|c}
& \text{Disease} & \text{No Disease} \\\hline
\text{Male} & a & b \\\hline
\text{Female} & c & d
\end{array}
$$

Based on this information, I can calculate the Odds Ratio for Gender as:

$$
\text{Odds Ratio}_{\text{gender}} = \frac{a \times d}{b \times c} = x
$$

To me, this means that if all other variables (e.g. age, weight) are not taken into consideration and the dataset we have is well representative of the population – if we were to pick a random man from the population and a random woman from the population, the odds of the man having this disease vs the odds of the female having this disease is $x:1$ . In other words, the odds of having the disease increases by a factor of $x$ if someone is a male compared to a female. (note: if you were to randomly pick a person from the population with this disease – the ratio of this person being a male vs being a female is the Relative Risk)

(Logistic Regression) Approach 2: Based on the above dataset, I could also fit a Logistic Regression Model to this data and calculate the Odds Ratio for Gender as:

$$
\text{log}\left(\frac{\text{P(disease = yes)}}{1 – \text{P(disease = yes)}}\right) = \text{logit}(\text{P(disease = yes)}) = \beta_0 + \beta_1 \times \text{age} + \beta_2 \times \text{gender} + \beta_3 \times \text{weight}
$$

$$
\text{Odds Ratio}_{\text{gender}} = \frac{\exp(\beta_0 + \beta_1 \times \text{age} + \beta_2 \times (\text{gender} = \text{Male} = 1) + \beta_3 \times \text{weight})}{\exp(\beta_0 + \beta_1 \times \text{age} + \beta_2 \times (\text{gender} = \text{Female} = 0) + \beta_3 \times \text{weight})} = e^{\beta_2} = z
$$

To me, this means that if all other variables are held as fixed (i.e. also not taken into consideration) and the sample is considered representative of the population – the odds of having the disease increases by a factor of $z$ if someone is a male compared to if someone is a female.

My Question: For the same data, I can clearly calculate Odds Ratio using a "simple approach" and using a more complicated approach using "Logistic Regression" – yet in both approaches, I am still estimating the same quantity (i.e. increase odds relative to some change in variable when all other variables are equal). What are the advantages of using either approach?

In my opinion, the advantage is that the Odds Ratio calculated using Logistic Regression is "adjusted" to take into account the influence of other variables – whereas the Odds Ratio calculated using the simple way does not take into account the influence of other variables.

For example, suppose when I fit a Logistic Regression model to this data, I find that the "age" variable contributes a lot more (e.g. size of "age" regression coefficient, p-value) to the probability of having the disease compared to the "gender" variable. As a result, the size of the $\beta_1$ coefficient is likely to greater than $\beta_2$. Thus, if I do decide to estimate the increase in Odds of having disease for a man vs a woman, the value of this Odds Ratio will be toned down seeing as the influence of gender is not that important.

On the other hand, when I interpret the Odds Ratio for Gender using the "simple approach" – since I will not be taking into consideration the contributions of other variables, I might end up overcompensating or undercompensating the effect of Gender on the increase in odds for developing the disease. This is akin to Omitted Variable Bias or Variable Confounding. To me, this sounds like the following example : Suppose I frequently attended basketball games for the Chicago Bulls – someone could count the number of times that the Chicago Bulls win vs losing when I am at the game vs not at the game. If the Chicago Bulls win a lot of games in general, someone could falsely conclude that my presence increases the odds of them winning! However, someone could add another variable to the analysis, such as "if Michael Jordan was playing" – now they would see that when the contribution for both of us is jointly taken into consideration, the Odds of the Chicago Bulls increases very little solely based on my presence alone!

Can someone please tell me if my understanding of the above concepts is correct?

Thanks!

Best Answer

Your take seems to be correct except for this bit:

yet in both approaches, I am still estimating the same quantity (i.e. increase odds relative to some change in variable when all other variables are equal).

In the "simple" approach, you do not condition on variables other than gender (and thus you do not ensure the other variables stay fixed when going from one gender to the other; there may be confounding), while in the logistic regression approach, you do. No wonder the two approaches yield different results.

Related Question