The simplest way to get empirically based confidence intervals is bootstrapping. If you do 999 bootstraps and rank your statistic of interest in ascending order, then the 25th and the 975th provide a simple estimate of 95% confidence intervals. I don't know of any advantages of cross-validation for this purpose, and you would have to do multiple cross-validations to get that good an estimate in any event. Bootstrapping has the advantage to being as close as you can get to resampling from the original population.
So if you already have a program that produces your statistic of interest from a set of bootstrap samples (like validate()
from the rms
package, which works on the output from the package's cph()
function), that's the easiest way to proceed. Functions like validate()
know how to keep track of parameter estimates from the original model separately from those based on the bootstrap resamples, which seems to be of some concern to you.*
There are other ways to estimate CIs from bootstraps. Bootstrapping Regression Models in R shows how to use facilities provided by R packages to get such estimates. It provides the approach for calculating the potentially better bias-corrected, accelerated (or $BC_a$) percentile intervals.
The boot
package can automate this process in general. First you write a function that produces your statistic of interest from a dataset (first argument to your function) based on a supplied vector of indices (second argument). There's a simple example of the structure of such a function on page 194 of ISLR, to augment the somewhat more opaque examples on the boot()
manual page. Then the boot()
function does the resampling of the cases, passing the vector of chosen indices each time to your function to calculate your statistic. You can use the boot.ci()
function on the output from boot()
to get 5 different types of CI estimates, including $BC_a$.
Finally, with respect to the C-index in particular, pay attention to how you intend to use it. It's not, for example, a very sensitive test for model selection according to Frank Harrell, who developed that index and also is responsible for the rms
package.
*If you have questions about how the calculations are performed you can get the code for functions written in R by typing the function name at the R prompt, sometimes preceded by the package name. After the rms
package is loaded, type rms:::validate.cph
to get its code, and predab.resample
to see the workhorse function that does most of the work.
If PrevDisc
and LowStress
are binary variables (this is my impression from your description), then the interaction model simply corresponds to four different zero-inflated Poisson distributions: one for each combination of PrevDisc
and LowStress
.
When using the formula TimeLoss ~ PrevDisc * LowStress
, treatment coding of the coefficients is used, i.e., the four parameters (in each model part) are coded as an intercept, two main effects, and an interaction effect. This coding facilitates judging whether or not the interaction effect is significant.
If you want to assess the PrevDisc
effect separately for the two LowStress
groups, then you can use a nested coding of the coefficients via the formula TimeLoss ~ LowStress/PrevDisc
.
In either case, all inference can be done "in the usual way", i.e., summary()
and confint()
for marginal Wald tests and Wald confidence intervals. But also lrtest()
(from lmtest
) for nested model comparisons with the likelihood ratio test or AIC()
/BIC()
. (Generalized) linear hypotheses can be tests with linearHypothesis()
(from car
) and glht()
(from multcomp
) respectively.
The difference in interpretation between the two model parts is that the count model is a log-linear model for the mean in the count component. The zero model is log-linear for the odds of zero inflation (i.e., an observation from the point mass component).
Best Answer
From the
Details
section of the helpSo predictdf will generally call
stats::predict
, which in turn will call the correctpredict
method for the smoothing method. Other functions involving stat_smooth are also useful to consider.Most model fitting functions will have
predict
method associated with theclass
of the model. These will usually take anewdata
object and an argumentse.fit
that will denote whether the standard errors will be fitted. (see?predict
) for further details.This is passed directy to the predict method to return the appropriate standard errors (method dependant)
This defines the
newdata
values forx
at which the predictions will be evaluatedPassed directly to the predict method so that the confidence interval can define the appropriate critical value (eg
predict.lm
usesqt((1 - level)/2, df)
for the standard errors to be multiplied byUsed in conjunction with
fullrange
to define thex
values in thenewdata
object.Within a call to
stat_smooth
you can definese
which is what is partially matched tose.fit
(orse
), and will define theinterval
argument if necessary.level
will give level of the confidence interval (defaults 0.95).The
newdata
object is defined within the processing, depending on your setting offullrange
to a sequence of lengthn
within the full range of the plot or the data.In your case, using
rlm
, this will usepredict.rlm
, which is defined asSo it is internally calling
predict.lm
with an appropriate scaling of theqr
decomposition andscale
argument.