Regression – How to Choose Theta Thresholds for OCAT Family in MGCV Package (Ordinal Regression)

generalized-additive-modelmgcvordered-logitratioregression

I have a dependent variable named "FGR" which represents a ratio and takes values between 0 and +Inf. The histogram plot below illustrates the distribution of FGR, excluding +Infinite values

enter image description here

To categorize the FGR variable, I created a new ordinal variable called "FGR_cat" with three categories:

fit1 (<0.30)
fit2 (>= 0.30 & < 0.80)
fit3 (>= 0.80)
Next, I conducted a generalized additive model using the ocat family (ordinal logistic regression). The model formula is as follows:


gam_model <- mgcv::gam(FGR_cat ~ s(Hormone) + Age + BMI + as.factor(PCOS) + s(Patient, bs = "re")
                       , data = my_data,
    ,method = "REML", family = ocat(theta = 1))

The model results for the effect of hormone are as follows:

Family: Ordered Categorical(-1,0)

Link function: identity

edf Ref.df Chi.sq p-value

s(Hormone) 1.0000083 1 4.12 0.0424

Deviance explained = 2.06%

-REML = 419.06 Scale est. = 1 n = 351 AIC value: 829.7235

The stacked area chart below shows the predicted probabilities of fit1, fit2, and fit3 along the hormone axis:

enter image description here

However, when I set the threshold for the linear predictor to 5 (theta = 5) using the following code:

,method = "REML", family = ocat(theta = 5))

The updated results are as follows:

Family: Ordered Categorical(-1,4) Link function: identity

edf Ref.df Chi.sq p-value

s(Hormone) 1.0 1 6.911 0.00857 **

Deviance explained = 71.4%

-REML = 359.13 Scale est. = 1 n = 351 AIC value = 646.0894

enter image description here

As the value of theta increases, the significance of the effect of Hormone becomes more pronounced, and the AIC consistently decreases. Eventually when theta reaches 10 or higher, no predicted probability for fit3 is observed at any level of Hormone.

By adjusting the value of the theta threshold, it is possible to obtain a wide range of predicted probabilities. However, determining the appropriate threshold values for the linear predictor that can be confidently reported in scientific studies is crucial. What is the most accurate approach for defining these threshold values? Alternatively, if you have any alternative methods for modeling the FGR ratio, please provide an explanation. Thank you

EDIT after the comment of @Paul: In the explanation of theta it is written:
"cut point parameter vector (dimension R-2). If supplied and all positive, then taken to be the cut point increments (first cut point is fixed at -1). If any are negative then absolute values are taken as starting values for cutpoint increments"

I could not understand the meaning of last sentence. When i choose theta as positive, such as 3, it makes an increment in the value of first cutpoint -1 by 3 and the result is -1 + 3 = 2 as follows;

Family: Ordered Categorical(-1,2)

When i choose the theta as a negative value, it defines the cutoff values as 1.62 as follows;

Family: Ordered Categorical(-1,1.62)

How did it choose the threshold of 1.62?

Best Answer

Unless you have very strong prior knowledge, it is better to estimate the threshold values, $\boldsymbol{\theta}$. In that case you must specify the number of categories, $R$, via argument R to the ocat() family, and leave the theta argument set at its default NULL value.

By way of example, here is the model I use for testing in {gratia}:

library("gratia")
library("mgcv")

# Ordered categorical model ocat()
n_categories <- 4                                            # <--- How many categories

# simulate data
su_eg1_ocat <- data_sim("eg1", n = 200, dist = "ordered categorical",
  n_cat = n_categories, seed = 42)

# fit model to these data
m_ocat <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3),
  family = ocat(R = n_categories),           # <--- specify R, number of categories here
  data = su_eg1_ocat, method = "REML")
```
Related Question