As time normally *increases*, I would rewrite this as:

```
(d1 <- data.frame(t1=seq(5),
de1=c(0, 1195, 1237, 1251, 1257)))
t1 de1
1 1 0
2 2 1195
3 3 1237
4 4 1251
5 5 1257
```

There is no censoring until the final time point.

We can ignore the first time point as there was no death or censoring there.

For the last four time points we can get the number that died as

```
(diff(d1$de1))
[1] 1195 42 14 6
```

We also know from the data you provide that there were $141$ or `1398 - sum(diff(d1$de1))`

still alive and so censored at the final time.

Thus:

```
library(survival)
s1 <- Surv(c(rep(2, 1195), rep(3, 42), rep(4, 14), rep(5, 6), rep(5, 141)),
c(rep(1, 1195), rep(1, 42), rep(1, 14), rep(1, 6), rep(0, 141))
)
```

which we can use as normal e.g.

```
(survfit(s1 ~ 1))
Call: survfit(formula = s1 ~ 1)
records n.max n.start events median 0.95LCL 0.95UCL
1398 1398 1398 1257 2 2 2
```

Given the example, this is a rather 'concrete' solution and could if it needs to be generalized further then `paste`

/`apply`

should be helpful.

A rate has a specific definition of $\frac{\# \mbox{events}}{\# \mbox{person-years}}$. A risk on the other hand refers to a particular individual's risk of experiencing an outcome of interest, and it is risk which is intrinsically related to the hazard (instantaneous risk). The language the question uses is consistent with this understanding. If I had to change it, I would say, "The death rate for *smokers* is twice that of *non-smokers". They also failed to mention whether these were age adjusted rates or not.

To understand this a little more deeply, *relative rates* and *relative risks* are estimated with fundamentally different models.

If you wanted to formalize a rate, you can think of this as estimating:

$$E \left( \frac{\# \mbox{events}}{\# \mbox{person-years}} \right) =\frac{\sum_i Pr(Y_i < t_i)} {\sum_i t_i} $$

($Y_i$ is the death time and $t_i$ is the observation time for the $i$-th individual, note the times are considered fixed and not random!)

You'll recognize the numerator is a bunch of CDFs, or 1-survival functions, and the relationship with survival functions and hazards is well known.

So if you took a ratio of rates:

$$ 2 = E \left( \frac{ \# \mbox{smoker deaths} \times \# \mbox{non-smoker person-years}}{\# {non-smoker deaths} \times \# \mbox{smoker person years}} \right) = \frac{\sum_i t_i}{\sum_j t_j} \frac{\sum_j Pr(Y_j < t_j)}{\sum_i Pr(Y_i < t_i)}$$

$$ = \frac{\sum_i t_i}{\sum_j t_j} \frac{ n_j-\sum_jS(t_j)}{n_i-\sum_iS(t_i)}$$

Since it's self study, you should probably do the algebra and solve the remainder of the equation!

## Best Answer

Combining proportions dying as you do is not giving you cumulative hazard. Hazard rate in continuous time is a conditional probability that during a very short interval an event will happen:

$$h(t) = \lim_{\Delta t \rightarrow 0} \frac {P(t<T \le t + \Delta t | T >t)} {\Delta t}$$

Cumulative hazard is integrating (instantaneous) hazard rate over ages/time. It's like summing up probabilities, but since $\Delta t$ is very small, these probabilities are also small numbers (e.g. hazard rate of dying may be around 0.004 at ages around 30). Hazard rate is conditional on not having experienced the event before $t$, so for a population it may sum over 1.

You may look up some human mortality life table, although this is a discrete time formulation, and try to accumulate $m_x$.

If you use R, here's a little example of approximating these functions from number of deaths at each 1-year age interval:

Hope this helps.