Solved – How to fit the SIR and SEIR models to the epidemiological data

curve fittingepidemiologyfittingMATLAB

I have the epidemiological data [xlsx] [csv] of the 2014 outbreak of the Ebola virus in Sierra Leone.

I wanted to model the outbreak with both the SIR compartmental model

$$
\begin{align}
{\mathrm d S \over \mathrm d t} &= -\beta {S I \over N}\\[1.5ex]
{\mathrm d I \over \mathrm d t} &= \beta {S I \over N} – \gamma I \\[1.5ex]
{\mathrm d R \over \mathrm d t} &= \gamma I \\
\end{align}
$$

and the SEIR compartmental model

$$
\begin{align}
{\mathrm d S \over \mathrm d t} &= -\beta {S I \over N}\\[1.5ex]
{\mathrm d E \over \mathrm d t} &= \beta {S I \over N} – \delta E \\[1.5ex]
{\mathrm d I \over \mathrm d t} &= \delta E – \gamma I \\[1.5ex]
{\mathrm d R \over \mathrm d t} &= \gamma I \\
\end{align}
$$

pretty much like it was (claimed to be) done on page 7 of the printed version of this paper.

Since the S(E)IR models don't admit a close analytical solution I modelled them in Matlab with a couple of .m files for each model.

The SIR and SEIR functions return the cumulative state (what I interpreted as the integral of each state variable) of the system since I only have the cumulative data from the epidemiologic bulletins.


SIR

SIR_factory.m

% Returns a function that returns the evaluation of the ODE defining the
% system at time t.
%
% Using a High-order function saves me from learning how to pass
% parameters to the function used by the ODE solvers :P
function ode = SIR_factory(beta, gamma, N)
    %Returns an array with [dS/dt dI/dt dR/dt]
    ode = @(t,y) [
                %y(1) = S, y(2) = I, y(3) = R
                -beta*y(1)*y(2)/N
                beta*y(1)*y(2)/N-gamma*y(2)
                gamma*y(2)
            ];

SIR.m

% For some reason the function must handle the input x in a matrix form or
% the fitting app invocation of this function will throw a matrix concat
% error
%
%INPUT:
%   beta and gamma are the system parameters
%   S0 and I0 are the initial conditions
%   x is the time at which the state of the system is requested
%   i is the index of the state of the system to return (1 = S, 2 = I, 3 =
%   R)
function R = SIR(beta, gamma, S0, I0, x, i)
    %Total population is S0 (Susceptible individual) plus I0 (Infected
    %ones), this means R0 is implicitly 0
    N = S0 + I0;

    %Find the max of the matrix
    M = max(max(x));

    %Pre-allocate the output
    R = zeros(size(x, 1), size(x, 2));

    %Solve the system, from time 0 to the max time requested, with the
    %initial conditions given
    [t,y] = ode45(SIR_factory(beta, gamma, N), [0 M], [S0, I0, 0]);

    %Comput the "integral" of the state variable requested
    y = cumtrapz(t, y(:,i));

    %Fill the output (Ugly)
    for r=1:size(R,1)
        for c=1:size(R,2)
           R(r,c) = interp1q(t, y, x(r, c));
        end
    end

SEIR

SEIR_factory.m

% Returns a function that returns the evaluation of the ODE defining the
% system at time t.
%
% Using a High-order function saves me from learning how to pass
% parameters to the function used by the ODE solvers :P
function ode = SEIR_factory(beta, gamma, delta, N)
    %Returns an array with [dS/dt dE/dt dI/dt dR/dt]
    ode = @(t,y) [
                %y(1) = S, y(2) = E, y(3) = I, y(4) = R
                -beta*y(1)*y(3)/N
                beta*y(1)*y(3)/N-delta*y(2)
                delta*y(2)-gamma*y(3)
                gamma*y(3)
            ];

SEIR.m

% For some reason the function must handle the input x in a matrix form or
% the fitting app invocation of this function will throw a matrix concat
% error
%
%INPUT:
%   beta, gamma and delta are the system parameters
%   S0 and E0 are the initial conditions
%   x is the time at which the the state of the system is requested
%   i is the index of the state of the system to return (1 = S, 2 = E, 3 =
%   I, 4 = R)
function R = SEIR(beta, gamma, delta, S0, E0, x, i)
    % The total population is the number of Susceptible individualts plus
    % the Exposed one. This implicitly sets I0 = R0 = 0
    N = S0 + E0;

    %Max of the input matrix
    M = max(max(x));

    %Pre-allocate the output
    R = zeros(size(x, 1), size(x, 2));

    %Solve the system from time 0 to time M with the initial given
    %conditions
    [t,y] = ode45(SEIR_factory(beta, gamma, delta, N), [0 M], [S0, E0, 0, 0]);

    %Compute the "integral" of the required state
    y = cumtrapz(t, y(:,i));

    %Fill the output matrix
    for r=1:size(R,1)
        for c=1:size(R,2)
           R(r,c) = interp1q(t,y,x(r, c));
        end
    end

The intent is to fit these models the data of the outbreak.

To do so, I converted the report dates into days since the first report and for each row computed the number of infected individuals as the Total cases minus the Total Deaths.
The resulting CSV is here.
The original data, as an .xlsx file, are here.

To get an idea of the data, here is an excerpt

...
121,235
125,301
128,300
131,322
132,373
136,419
140,415
141,449
143,462
147,483
149,533
150,518
156,604
165,770
...

I read this data into Matlab, after adding the appropriate path, as

>>>DATA = csvread('ebola-sierra-leone-inf.csv')
>>>DAYS = DATA(:,1)
>>>INF = DATA(:,2)

Then I used the Curve fitting app to fit the models.
The fitting session .sfit file is here.

In the year 2014 the Sierra Leone population was about 7.079 millions which I use as the initial $S_0$ subtracted an unknown quantity representing the $I_0$ (SIR) or $E_0$ (SEIR) variable.

SIR

I create a SEIR fitting, using DAYS as X data and INF as Y data.

I chose a custom equation of expression SIR(b, c, 7079000 - I, I, x, 2) that represents a SEIR model with parameters b ($\beta$) and c ($\gamma$) (all constrained in [0, 1] and with initial values of 0.5) and returns the cumulative number of infected individual over time.

The fitting parameter I is constrained in [0, $\infty$) and has an initial value of 1.

SEIR

I create a SEIR fitting, using DAYS as X data and INF as Y data.

I chose a custom equation of expression SEIR(b, c, d, 7079000 - E, E, x, 3) that represents a SEIR model with parameters b ($\beta$), c ($\gamma$) and d ($\delta$) (all constrained in [0, 1] and with initial values of 0.5) and return the cumulative number of infected individual over time.

The fitting parameter E is constrained in [0, $\infty$) and has an initial value of 1.


Results

Matlab finds a very poor fit.

SIR

General model:
     f(x) = SIR(b, c, 7079000 - I, I, x, 2)
Coefficients (with 95% confidence bounds):
       I =          21  (16.63, 25.38)
       b =      0.9999  (-0.1879, 2.188)
       c =           1  (-0.1859, 2.186)

Goodness of fit:
  SSE: 3.124e+08
  R-square: 0.9153
  Adjusted R-square: 0.9147
  RMSE: 1092

Fitted SIR

SEIR

General model:
     f(x) = SEIR(b, c, d, 7079000 - E, E, x, 3)
Coefficients (with 95% confidence bounds):
       E =       43.96  (-20.21, 108.1)
       b =      0.9989  (-2.32, 4.318)
       c =           1  (-2.317, 4.317)
       d =           1  (fixed at bound)

Goodness of fit:
  SSE: 3.252e+08
  R-square: 0.9119
  Adjusted R-square: 0.9112
  RMSE: 1114

Fitted SEIR

However…

… if I change the population count from about 7.079 million to 70.790 (1/100 of the original) both fitting improve

Fitted SEIR with 1/100 of the population

Obviously, I'm doing something wrong conceptually and/or numerically.

How can I fit a S(E)IR model to the epidemiological data?

Best Answer

I am going to confine my comments to the SEIR model - the issues for the SIR model are similar and it can be treated as a special limiting case of the SEIR model anyway (for large $\delta$).

What you've done so far

I've had a look at your MATLAB code, which seems absolutely fine to me. For a given set of model parameters, your code solves the SEIR differential equations to give functions $S(t)$,$E(t)$, $I(t)$, $R(t)$ on some time interval. You then calculate the cumulative state $J(t):=\int_0^t I(u) du$ which is used as a basis for fitting the model (correct me if I'm wrong here).

Available data: you have time series $C_{data}(t)$ and $M_{data}(t)$, which are the cumulative number of cases and deaths respectively. Model fitting proceeds by minimising the difference between the curves $J(t)$ and $C_{data}(t)-M_{data}(t)$. (This assumes a disease case corresponds to an individual transitioning to the $I$ state.) A poor fit is obtained. It's also questionable how meaningful the confidence intervals are - the lower limit is often negative even though all model parameters are constrained to be positive.

Model vs data

I can see several issues with the way that the specified SEIR model relates to the available data. Firstly $J(t)$ above does not represent the number of infectious individuals, which is simply $I(t)$. It seems that you actually want to be equating $I(t)$ to $C_{data}(t)-M_{data}(t)$. Computing $J(t)$ seems unnecessary.

Second, it appears that you're implicitly assuming that 'recovery' (transition to the $R$ category) always leads to death. However, I understand that - in the case of Ebola - it is also possible to be 'cured'. So, the available death data can't be directly related to the variables in the SEIR model you set up. This points to the need for a model that will take account of the different recovery modes that are possible with Ebola.

A third issue is that, by subtracting one data time series from the other, you're losing some of the information in the original data. Ideally it would be good to fit the model using both of the available time series.

Modified SEIR model and fitting procedure

To improve model fitting I would suggest looking at the modelling done in this paper. Here they use a modified SEIR model for Ebola, which looks something like \begin{align} {\mathrm d S \over \mathrm d t} &= -\beta {S I \over N}\\[1.5ex] {\mathrm d E \over \mathrm d t} &= \beta {S I \over N} - \delta E \\[1.5ex] {\mathrm d I \over \mathrm d t} &= \delta E - \gamma I \\[1.5ex] {\mathrm d R \over \mathrm d t} &= (1-f)\gamma I \\ \end{align} Here $f$ is the case fatality rate, so the $R$ state corresponds to 'cured'. In the context of this model, the cumulative number of cases is $C(t)=\int_0^t \delta E(u)du$ and the cumulative number of deaths is $M(t)=\int_0^t f\gamma I(u)du$. Perhaps it would be possible to fit these two curves simultaneously in MATLAB?

Other models

More complex models are of course possible e.g. see this paper where additional disease categories are used. We could also add stochasticity, more detailed contact structure models, etc. Fitting transmission models to the 2014 Ebola outbreak data is an active area of research. Still, you might hope to get a reasonable fit using the modified SEIR model above. What I'm trying to say is that fitting transmission models to the Ebola outbreak data is not a trivial task!

Finally: the paper you refer to does not appear to be a peer reviewed journal article. It's also anonymous. I wouldn't rely on it as an information source.