Solved – How to interpret the figure output from package dlnm in R

nonlinear regressionrrelative-risktime series

I am performing distributed non-linear lag models in R.

I got the figure result of dlnm as shown in the vignette (pdf) on page 13:

enter image description here

The X-axis is lag, which I can understand. However, I cannot understand what the label for the Y-axis means. For the figure on page 13, the Y-axis is called "RR" (relative risk) which is positive; however, I got a negative RR when I input my data. My $X$ is temperature, my $Y$ is pollutant concentration. Can I still use the model?

Could anybody explain it to me a little bit?

The R code that I am using is:

OrigData <- read.csv("Hourly data for use SHULE-2.csv", na.strings = ".", 
                     header=T, sep=",")
CO2H6 = OrigData$CO2H6      # $
OutT  = OrigData$OutT       # $
Year  = OrigData$Year       # $
Month = OrigData$Month

argvar <- list(type="ns", df=11)
arglag <- list(type="ns", df=5)

suppressWarnings(cb1 <-crossbasis(OutT, lag=12, argvar=argvar, arglag=arglag))
model1 <- glm(CO2H6~ cb1 +as.factor(Month))
summary(model1)

cp <- crosspred(cb1,model1,-22.83:34.79,by=1, cumul=TRUE)
par(cex.axis=0.7,cex.lab=0.7)
d3 <- plot(cp, xlab="Delay Effects of OutT in IN2H-H6", zlab="CO2 con.", r=90, 
           d=0.3, col="red" , xlab="Out T")

plot(cp, "slices", var=10, ci="n", ylim= c(-1000,1000), ylab="CO2 con.", lwd=1.0, 
     xlim=c(0,10))
for(i in 4:6){
  lines(cp, "slices", var=c(-20,-10,0,10,20,30)[i], col=i+1, lwd=1.5)
}
legend(7,1000, col=5:7, lwd=0.9, pch=1:1, legend=c("OutT =10℃", "OutT =20℃", 
                                                   "OutT =30℃"), cex=0.7)

enter image description here

Since RR cannot be negative, could there be a way for me to transfer my x(Temperature) and y (Pollutant con.) to make it fitting the requirement of the model?

Thanks for all your help!!

Best Answer

Interpretation of the graph in your case

Note: The y-axis is not always the relative risk as in the example given in the vignette of the dlnm package. This is only the case in their example, because they used mortality data and Poisson regression models. In their framework, the exponentiated regression coefficient from the Poisson models $RR=\exp(\hat{\beta})$ is the relative risk. This is analogous to exponentiating the regression coefficients in logistic regression, which is the odds ratio.

Can I still use the model?

Yeah, you can still use such a model.

Let's summarize what you do:

You use natural cubic B-splines as basis functions instead of polynomials to model the relationship between temperature and $\mathrm{CO}_{2}$ (arglag with option type="ns" instead of type="poly")
You assume that the effect of temperature is non-linear, as you specify argvar as splines. One important thing you have to know for the interpretation of the plots is that the function crossbasis automatically centeres the values at the predictor mean (i.e. the mean temperature) if not specified otherwise. This is the reference value with which the predictions are later compared in the graphics.
You consider lags up to 12 (option lag=12 in crossbasis). (Btw: Why do you suppress the warnings?)
You calculate a GLM with Gaussian errors and the identity link function, which is equivalent to a simple linear regression (OLS). You could have used the lm function instead.

The plot that you have provided is interpreted as follows: The x-axis is the lag.

Interpretation of the values of the y-axis: The y-axis depicts the changes in $\mathrm{CO}_{2}$ concentration associated with an increase of 10, 20 or 30°C compared to the mean temperature. If predicted change is 0, this means that an increase in temperature is not associated with an increase in $\mathrm{CO}_{2}$ concentration compared to $\mathrm{CO}_{2}$ concentration at mean temperature: The predicted $\mathrm{CO}_{2}$ concentration is the same at $\bar{x}_{Temp}+z$ degrees (where $z$ is any amount, say 10 or 20 degrees) and at mean temperature $\bar{x}_{Temp}$.

This means that for an increase in temperature of 10°C, the temperature at lag 0 (on the same hour) increases the $\mathrm{CO}_{2}$ concentration compared to the mean temperature. Because you specified cumul=TRUE in crosspred, the effects are cumulative. The cumulative effects of an increase of 10°C are quasi nonexistent after 4 hours compared to the mean temperature. This suggests that the non-cumulative effects are negative at lags 1-4 and null effects from that on.

For temperature increases of 20 or 30°C, the cumulative effects on the $\mathrm{CO}_{2}$ concentration are lower in the first 1-4 hours compared to $\mathrm{CO}_{2}$ at the mean temperature. As with temperature increases of 10°C, the cumulative effects are practically nonexistent after 4 or 5 hours. Again: $\mathrm{CO}_{2}$ concentrations are the same at mean temperature and at an increase in temperature of 20 or 30°C after 4 or 5 hours.

I think a contour plot would be easier to interpret. Try the following code:

plot(cp, xlab="Temperature", col="red", zlab="CO2", shade=0.6,
     main="3D graph of temperature effect")

Interpretation of the example given in the vignette of the dlnm packge

First, a little something about distributed lag models. They have the form:

$$ Y_{i}=\alpha + \sum_{l=0}^{K}\beta_{j}x_{t-l} + \text{other predictors} +\epsilon_{i} $$ where $K$ is the maximum lag and $x$ is a predictor. This is just fitted using a multiple linear regression. So the coefficient $\beta_{1}$ would estimate the effect of $x_{t-1}$ of the day before on $Y_{t}$. In essence, multiple lags of the predictors are included in the model simultaneously. This obviously has the problem that the lagged predictors are highly correlated (autocorrelation).

A more advanced method are polynomial distributed lag models. It has the same basic formula as above, but the impulse-response function is forced to lie on a polynomial of degree $q$ (link to a paper for Stata):

$$ \beta_{i} = a_{0} + a_{1}i + a_{2}i^2 +\ldots+a_{q}i^q $$ where $q$ is the degree of the polynomial and $i$ the lag length. Another formulation is $$ \beta_{i} = a_{0} + \sum_{j=1}^{q}a_{j}f_{j}(i) $$

Where $f_{j}(i)$ is a polynomial of degree $j$ in the lag length $i$. A good introduction to the dlnm package and polynomial distributed lag models can be found here.

These models are often used in studies about air-pollution and health because air-pollution has lagged effects on health outcomes.

Let's look at this graph from the vignette of the dlnm package (page 13):

DLNM poylag

The degree of the polynomials was $q=4$ in this case so the green line is a polynomial of 4th degree. The y-axis is the relative risk (RR) estimated via Poisson regression and the x-axis the considered lag. The relative risk has the following interpretation: Persons who were exposed have a $(RR-1)\cdot100\%$ higher/lower chance of getting the outcome (e.g. death, lung cancer, etc.) compared to people who were not exposed. If $RR>1$ this means a positive association and if $RR<1$ means a protective association. A $RR=1$ means no association. We see that for every increase of $\textrm{PM}_{10}$ by 10 units ($\mu \mathrm{g}/m^{3}$), there is a $(1.001-1)\cdot100\%=0.1\%$ increase in the risk to die at lag 0 (i.e. on the same day as the exposure). Strangely, the exposure from about 9 days ago is protective: an increase of 10$~\mu \mathrm{g}/m^{3}$ is associated with a decreased risk to die compared to people with 10 units less exposure. We can also see that the exposure from 15 days before doesn't play a role (i.e. $RR\approx1$).

Let's look at the cumulative relative risk:

DLNM cumulative

This is the same as before but the effects are cumulated over time (i.e. summing all contributions from the lags up to the maximum lag). The red line starts at the same point as the green line in the first graphic (i.e. $\approx1.001$). We can see that people who have been exposed for five days have an increased cumulative risk of about $(1.005-1)\cdot100\%=0.5\%$ to die compared to non-exposed people. Because the green line goes below the relative risk of $1$ after a lag of about 5 days, the cumulative association after 15 days is nearly $1$. This means that the protective effects of $\textrm{PM}_{10}$ from lag 5 on have compensated the harmful effects from earlier lags. Whether that is scientifically reasonable is another quesiton.

Related Solutions

Solved – R package TSA: how to interpret the IO coefficients output of the arimax function

My answer is quite limited but may help you or others get on the right track. I am suspicious about the TSA io argument in arimax(), because it always gives me the same results as I get when I fit additive outliers. (I fit additive outliers using indicator variables in xreg, as on page 455 of Cryer and Chan's text.) I have made comparisons in multiple settings. It is clear that detectIO and detectAO use different (and I assume correct) algorithms because lambda values differ even when AOs and IOs are both flagged at the same time point. It is also in my experience not possible to fit an AO at time T and simultaneously use io(c(T)), suggesting parameter redundancy. My conclusion is that "io" should be written as "ao". I hope Dr. Chan can either clarify (or correct ??) the io command.

Solved – How to interpret the output of predict.coxph

Edit: the following description applies to survival versions 3.2-8 and below. Starting with version 3.2-9, the default behavior of predict.coxph() changes with respect to treating 0/1 (dummy indicator) variables. See NEWS.

predict.coxph() computes the hazard ratio relative to the sample average for all $p$ predictor variables. Factors are converted to dummy predictors as usual whose average can be calculated. Recall that the Cox PH model is a linear model for the log-hazard $\ln h(t)$:

$$ \ln h(t) = \ln h_{0}(t) + \beta_{1} X_{1} + \dots + \beta_{p} X_{p} = \ln h_{0}(t) + \bf{X} \bf{\beta} $$

Where $h_{0}(t)$ is the unspecified baseline hazard. Equivalently, the hazard $h(t)$ is modeled as $h(t) = h_{0}(t) \cdot e^{\beta_{1} X_{1} + \dots + \beta_{p} X_{p}} = h_{0}(t) \cdot e^{\bf{X} \bf{\beta}}$. The hazard ratio between two persons $i$ and $i'$ with predictor values $\bf{X}_{i}$ and $\bf{X}_{i'}$ is thus independent of the baseline hazard and independent of time $t$:

$$ \frac{h_{i}(t)}{h_{i'}(t)} = \frac{h_{0}(t) \cdot e^{\bf{X}_{i} \bf{\beta}}}{h_{0}(t) \cdot e^{\bf{X}_{i'} \bf{\beta}}} = \frac{e^{\bf{X}_{i} \bf{\beta}}}{e^{\bf{X}_{i'} \bf{\beta}}} $$

For the estimated hazard ratio between persons $i$ and $i'$, we just plug in the coefficient estimates $b_{1}, \ldots, b_{p}$ for the $\beta_{1}, \ldots, \beta_{p}$, giving $e^{\bf{X}_{i} \bf{b}}$ and $e^{\bf{X}_{i'} \bf{b}}$.

As an example in R, I use the data from John Fox' appendix on the Cox-PH model which provides a very nice introductory text. First, we fetch the data and build a simple Cox-PH model for the time-to-arrest of released prisoners (fin: factor - received financial aid with dummy coding "no" -> 0, "yes" -> 1, age: age at the time of release, prio: number of prior convictions):

> URL   <- "https://socialsciences.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
> Rossi <- read.table(URL, header=TRUE)                  # our data
> Rossi[1:3, c("week", "arrest", "fin", "age", "prio")]  # looks like this
  week arrest fin age prio
1   20      1  no  27    3
2   17      1  no  18    8
3   25      1  no  19   13

> library(survival)                                      # for coxph()    
> fitCPH <- coxph(Surv(week, arrest) ~ fin + age + prio, data=Rossi)    # Cox-PH model
> (coefCPH <- coef(fitCPH))                              # estimated coefficients
     finyes         age        prio 
-0.34695446 -0.06710533  0.09689320

Now we plug in the sample averages for our predictors into the $e^{\bf{X} \bf{b}}$ formula:

meanFin  <- mean(as.numeric(Rossi$fin) - 1)   # average of financial aid dummy
meanAge  <- mean(Rossi$age)                   # average age
meanPrio <- mean(Rossi$prio)                  # average number of prior convictions
rMean <- exp(coefCPH["finyes"]*meanFin        # e^Xb
           + coefCPH["age"]   *meanAge
           + coefCPH["prio"]  *meanPrio)

Now we plug in the predictor values of the first 4 persons into the $e^{\bf{X} \bf{b}}$ formula.

r1234 <- exp(coefCPH["finyes"]*(as.numeric(Rossi[1:4, "fin"])-1)
           + coefCPH["age"]   *Rossi[1:4, "age"]
           + coefCPH["prio"]  *Rossi[1:4, "prio"])

Now calculate the relative risk for the first 4 persons against the sample average and compare to the output from predict.coxph().

> r1234 / rMean
[1] 1.0139038 3.0108488 4.5703176 0.7722002

> relRisk <- predict(fitCPH, Rossi, type="risk")   # relative risk
> relRisk[1:4]
        1         2         3         4 
1.0139038 3.0108488 4.5703176 0.7722002

If you have a stratified model, the comparison in predict.coxph() is against the strata-averages, this can be controlled via the reference option that is explained in the help page.