Solved – SIR model parameter estimation in R

epidemiologypoint-estimationr

searched on google about the sir model in r and I came up with the following code.

Infected <- c(1,3,4,7,7,7,7,9,31,45,66,73,84,89,99,117,190,217,319,340,368,399,439,466,498,590,649,694,767,824,886,966,1156)

SIR <- function(time, state, parameters) {
  par <- as.list(c(state, parameters))
  with(par, {
    dS <- -beta/N * I * S
    dI <- beta/N * I * S - gamma * I
    dR <- gamma * I
    list(c(dS, dI, dR))
  })
}

library(deSolve)
init <- c(S = N-Infected[1], I = Infected[1], R = 0)
RSS <- function(parameters) {
  names(parameters) <- c("beta", "gamma")
  out <- ode(y = init, times = Day, func = SIR, parms = parameters)
  fit <- out[ , 3]
  sum((Infected - fit)^2)
}


Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) 

Opt_par <- setNames(Opt$par, c("beta", "gamma"))
Opt_par


t <- 1:190 # time in days



fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par))

In this code, we want to estimate beta and gamma and then solve the ode with these values.

My question is the infected and recovered data are not used for the estimation of the beta and gamma except the first value of infected.
Wouldnt be more sufficient if we included all the infected data for the optimization of beta and gamma?

Best Answer

The statement

My question is the infected and recovered data are not used for the estimation of the beta and gamma except the first value of infected.

is incorrect. In the loss function (squared loss) definition, the code uses all infected data to determine the best $\beta$ and $\gamma$.

sum((Infected - fit)^2)

From fitting perspective, it should be fine, because we only ave very few parameters to fitting (in this case, just two parameters) and we have a lot of data (number of infections per day). It is more like a over determined system.

Related Solutions

Solved – Maximum Likelihood Estimate of Infection Model Parameters

Here's a possibility in which the model is modified (1) to be explicitly probabilistic, and (2) to take place in discrete time.

The code below explains the modified model, simulates it, and then uses MLE to recover the parameters (whose true value is known in this toy example, since we simulated the data). Careful: my beta will not be exactly equivalent to your beta -- see "story" in the comments below.

library(ggplot2)
library(reshape2)

## S(t) susceptible, I(t) infected, R(t) recovered at time t
## Probabilistic model in discrete time:
## S(t+1) = S(t) - DeltaS(t)
## I(t+1) = I(t) + DeltaS(t) - DeltaR(t)
## R(t+1) = R(t) + DeltaR(t)
## DeltaR(t) ~ Binomial(I(t), gamma) >= 0
## DeltaS(t) ~ Binomial(S(t), 1 - (1 - beta)^I(t)) >= 0
## Story: each infected has probability gamma of recovering during the period;
## before recoveries are realized, each susceptible interacts with each infected;
## each interaction leads to infection with probability beta;
## susceptible becomes infected if >= 1 of her interactions leads to infection

simulate <- function(T=100, S1=100, I1=10, R1=0, beta=0.005, gamma=0.10) {
    stopifnot(T > 0)
    stopifnot(beta >= 0 && beta <= 1)
    stopifnot(gamma >= 0 && gamma <= 1)
    total_pop <- S1 + I1 + R1
    df <- data.frame(t=seq_len(T))
    df[, c("S", "I", "R")] <- NA
    for(t in seq_len(T)) {
        if(t == 1) {
            df$S[t] <- S1
                df$I[t] <- I1
            df$R[t] <- R1
                next
            }
            DeltaS <- rbinom(n=1, size=df$S[t-1], prob=1 - (1-beta)^df$I[t-1])
            DeltaR <- rbinom(n=1, size=df$I[t-1], prob=gamma)
        df$S[t] <- df$S[t-1] - DeltaS
        df$I[t] <- df$I[t-1] + DeltaS - DeltaR
        df$R[t] <- df$R[t-1] + DeltaR
        stopifnot(df$S[t] + df$I[t] + df$R[t] == total_pop)  # Sanity check
    }
    return(df)
}

inverse_logit <- function(x) {
    p <- exp(x) / (1 + exp(x))  # Maps R to [0, 1]
    return(p)
}
curve(inverse_logit, -10, 10)  # Sanity check

loglik <- function(logit_beta_gamma, df) {
    stopifnot(length(logit_beta_gamma) == 2)
    beta <- inverse_logit(logit_beta_gamma[1])
    gamma <- inverse_logit(logit_beta_gamma[2])
    dS <- -diff(df$S)
        dR <- diff(df$R)
    n <- nrow(df)
    pr_dS <- 1 - (1-beta)^df$I[seq_len(n-1)]  # Careful, problematic if 1 or 0
        return(sum(dbinom(dS, size=df$S[seq_len(n-1)], prob=pr_dS, log=TRUE) +
               dbinom(dR, size=df$I[seq_len(n-1)], prob=gamma, log=TRUE)))
}

get_estimates <- function() {
    df <- simulate()
    mle <- optim(par=c(-4, 0), fn=loglik, control=list(fnscale=-1), df=df)
    beta_gamma_hat <- inverse_logit(mle$par)
    names(beta_gamma_hat) <- c("beta", "gamma")
    return(beta_gamma_hat)
}

set.seed(54321999)

df <- simulate()
df_melted <- melt(df, id.vars="t")
p <- (ggplot(df_melted, aes(x=t, y=value, color=variable)) +
      geom_line(size=1.1) + theme_bw() +
      xlab("time") +
      theme(legend.key=element_blank()) +
      theme(panel.border=element_blank()))
p

## Sampling distribution of beta_gamma_hat
estimates <- replicate(100, get_estimates())
df_estimates <- as.data.frame(t(estimates))
summary(df_estimates)  # Looks reasonable given true values of (0.005, 0.10)

Let me know if anything is not self-explanatory.

Disclaimer: I haven't studied the SIR model except once very briefly in a college class, several years ago. The model I simulate and estimate above is not exactly the classic differential equation SIR model you stated in your question. Also I'm feeling a bit feverish today, so check the code for mistakes!

Solved – How to fit the SIR and SEIR models to the epidemiological data

I am going to confine my comments to the SEIR model - the issues for the SIR model are similar and it can be treated as a special limiting case of the SEIR model anyway (for large $\delta$).

What you've done so far

I've had a look at your MATLAB code, which seems absolutely fine to me. For a given set of model parameters, your code solves the SEIR differential equations to give functions $S(t)$,$E(t)$, $I(t)$, $R(t)$ on some time interval. You then calculate the cumulative state $J(t):=\int_0^t I(u) du$ which is used as a basis for fitting the model (correct me if I'm wrong here).

Available data: you have time series $C_{data}(t)$ and $M_{data}(t)$, which are the cumulative number of cases and deaths respectively. Model fitting proceeds by minimising the difference between the curves $J(t)$ and $C_{data}(t)-M_{data}(t)$. (This assumes a disease case corresponds to an individual transitioning to the $I$ state.) A poor fit is obtained. It's also questionable how meaningful the confidence intervals are - the lower limit is often negative even though all model parameters are constrained to be positive.

Model vs data

I can see several issues with the way that the specified SEIR model relates to the available data. Firstly $J(t)$ above does not represent the number of infectious individuals, which is simply $I(t)$. It seems that you actually want to be equating $I(t)$ to $C_{data}(t)-M_{data}(t)$. Computing $J(t)$ seems unnecessary.

Second, it appears that you're implicitly assuming that 'recovery' (transition to the $R$ category) always leads to death. However, I understand that - in the case of Ebola - it is also possible to be 'cured'. So, the available death data can't be directly related to the variables in the SEIR model you set up. This points to the need for a model that will take account of the different recovery modes that are possible with Ebola.

A third issue is that, by subtracting one data time series from the other, you're losing some of the information in the original data. Ideally it would be good to fit the model using both of the available time series.

Modified SEIR model and fitting procedure

To improve model fitting I would suggest looking at the modelling done in this paper. Here they use a modified SEIR model for Ebola, which looks something like \begin{align} {\mathrm d S \over \mathrm d t} &= -\beta {S I \over N}\\[1.5ex] {\mathrm d E \over \mathrm d t} &= \beta {S I \over N} - \delta E \\[1.5ex] {\mathrm d I \over \mathrm d t} &= \delta E - \gamma I \\[1.5ex] {\mathrm d R \over \mathrm d t} &= (1-f)\gamma I \\ \end{align} Here $f$ is the case fatality rate, so the $R$ state corresponds to 'cured'. In the context of this model, the cumulative number of cases is $C(t)=\int_0^t \delta E(u)du$ and the cumulative number of deaths is $M(t)=\int_0^t f\gamma I(u)du$. Perhaps it would be possible to fit these two curves simultaneously in MATLAB?

Other models

More complex models are of course possible e.g. see this paper where additional disease categories are used. We could also add stochasticity, more detailed contact structure models, etc. Fitting transmission models to the 2014 Ebola outbreak data is an active area of research. Still, you might hope to get a reasonable fit using the modified SEIR model above. What I'm trying to say is that fitting transmission models to the Ebola outbreak data is not a trivial task!

Finally: the paper you refer to does not appear to be a peer reviewed journal article. It's also anonymous. I wouldn't rely on it as an information source.

Best Answer

Related Solutions

Solved – Maximum Likelihood Estimate of Infection Model Parameters

Solved – How to fit the SIR and SEIR models to the epidemiological data

Related Question