SIR Model for COVID-19: Estimating R0R_0

Roberto Berwa, MIT []

At the beginning of pandemic with a limited amount of data and knowledge of the transmission mechanisms behind COVID-19, various models from the scientific community, mathematicians, and epidemiologists, were useful in gauging a general view of how the pandemic would develop.

In this article, we discuss the most basic mathematical model of an infectious disease: SIR model, and we also use the model to estimate the basic reproductive number. We estimate the R0R_0 to be 4.90 (95% CI 4.67294, 5.13887) in the US, which agrees with the most recent study based on early data from Wuhan, China whose estimates sets R0R_0 median value at 5.7 (95% CI 3.8–8.9) (Sanche, Lin, Xu, Romero-Severson, Hengartner and Ke 2020).

Here is what is covered in this article.

  • SIR model description

  • Solve SIR ODE's

  • Estimating R0R_0

  • General use of this model

Disclaimer: The author of this article has no considerable expertise in the subject of the matter discussed. Opinions and results shown in this document should not be referenced or used while taking any decisions unless advised by an expert or a specialist.

Model Description: SIR MODEL

In mathematical modeling for infectious diseases, models can be of two types: stochastic (probabilistic) and deterministic. Stochastic models tend to be more complicated to analyze than deterministic models, but they also tend to be very useful as well. I will discuss a deterministic SIR model using ordinary differential equations.

Compartments of SIR model

The SIR model, originating from the 1700s, classifies a fixed population into three compartments at a given time: S(t) (susceptible group), I(t) (infected group), and R(t) (Recovered) (Kermack and  McKendrick  1927).

  • S(t): From a fixed population, this group counts the number of people who are susceptible to being infected at the time t.

  • I(t): The number of people who are currently infectious at time t. (i.e. people who can infect other people).

  • R(t): people who are no longer infectious at time t, either because they have been cured or unfortunately because they died.

In this model, the population moves from being susceptible to infected and from infected to recovery. S(t)I(t)R(t)S(t) \rightarrow I(t) \rightarrow R(t)

From S to I compartment, effective contact between an infected person and an infectious person must take place. Effective contact is defined as an interaction between a susceptible person and an infectious person that results in the susceptible person getting infected. The effective contact rate per infectious person, σ\sigma, is defined as the average number of people that an infected patient can infect per unit time. This can be computed from the transmission risk, p, the probability of infecting a susceptible person with whom an infected patient is in contact, and the number of total contacts with susceptible people, τ\tau.

η\eta, the average number of people that a person comes in contact with per unit time.

Let's define β\beta, the number of people infected by an infectious person given that all people in his/her contact are susceptible.

(i.e. we replace (eq.4) into (eq.3) to find (eq.5))

From I to R compartment, let γ\gamma be the probability that a person may recover at a time t.

Mathematical description of the model

Here is the mathematical expression of the model in the ODE(ordinary differential equation) form.

Dependent variable

For this article, we focus on the ratio of β\beta to γ\gamma which is known as the basic reproductive number, R0R_0, defined as the expected number of cases directly generated by one case in a population where all individuals are susceptible to infection. This ratio is useful to determine if the virus will become an epidemic or not. When R0R_0 > 1, the infectious disease evolves into an epidemic.


As already implied in the equations, there are various assumptions taken into account. First, the probability of contracting the disease or recovering from it is the same for everyone. Second, we have a fixed population. In the case of a pandemic that would mean that there are no births or deaths and that the net international migrant is zero. (The veracity of these assumptions are revisited in section 4.)

Solving SIR ODE model in Julia

The differential equations describing the model can be defined in various methods, the most basic being the Euler method. However, in this case, we will solve the ODE system using DifferentialEquations.jl , a well-optimized package in Julia built in consultation with the most recent techniques in the research of algebraic computation. (Julia benchmarks better than most single-purpose languages. )

To reduce the number of variables in the model and to avoid estimating the population, we set the ratio β\beta to N to just β\beta.

using Pkg;
using Plots
using DifferentialEquations;
# define the parameters:
function sir_ode!(δu, u, p,  t)
    # unpack variables and parameters:
    S, I, R = u
    β, γ = p   
    # define differential equations:
    δS = -β*S*I
    δI = +β*S*I - γ*I
    δR = +γ*I
    δu .= (δS, δI, δR)   # copy the values into the vector du; note the `.`
# define the paramaters:
β = 0.1
γ = 0.05
parameters = [β, γ]
# define the initital values:
S₀ = 0.99 
I₀ = 0.01
R₀ = 0.0
initial_values = [S₀, I₀, R₀]
time_span = [0.0, 200.0]  # initial and final time
# set up problem:
problem = ODEProblem(sir_ode!, initial_values, time_span, parameters)
solution = solve(problem, saveat = 0.1);
plot(solution, label=["S" "I" "R"], Title = "SIR Model Evolution", ylabel = "% of the population", xlabel = "days")

Given that βN=0.1&γ=0.05\frac{\beta}{N} = 0.1 \And \gamma = 0.05, from our experiment when we start with the scenario with 0.01 people are infectious, the infection will pick at 18% of the population. This case does not reflect COVID-19. (The interpretation is visited in Section 4).

Fitting the model

Models can be useful in estimating some key characteristics of an epidemic. It can estimate the recovery rate (γ\gamma) and the number of people infected by an infectious person given that all people in his/her contact are susceptible (β\beta). Here, we only estimate R0R_0.


We use the analytical solution to our ODEs and estimate the dependent variable using a non-linear least squares concept.

For non-linear least squares, the most common method is the Levenberg-Marquardt algorithm, (LMA ), also known as the Damped least-squares (DLS) method. It interpolates between the Gauss-Newton algorithm and gradient descent for efficiency (Gavin 2019). This article uses its implementation in Julia, but it is very popular and available in most programming languages.

Data Manipulation

We use the data from the Center for Systems Science and Engineering, John Hopkins' Whiting School of Engineering. We will focus on one country, the US. This data is not the most accurate, but the errors in the data are mainly due to external factors, such as lack of testing for those exposed.

As part of the model's assumptions, we fix the population of the US at 329,686,270, as of May 2020. Note that the US Census estimates about one birth every 8 seconds and one death every 12 seconds. However, we ignore these rates. More complicated, and possibly accurate, models take them into account.

using DataFrames
using Dates
using Plots
using CSV
# donwload the data:
url_confirmed_cases= ""
url_death = ""
url_recovered = ""
download(url_confirmed_cases, "confirmed_cases_global.csv");
download(url_death, "death_global.csv");
download(url_recovered, "recovered_global.csv");
# read, clean and view the data:
data_confirmed_cases ="confirmed_cases_global.csv")
data_death ="death_global.csv")
data_recovered ="recovered_global.csv")
CSV.rename!(data_recovered, 1 => "province", 2 => "country")
CSV.rename!(data_death, 1 => "state", 2 => "country")
CSV.rename!(data_confirmed_cases, 1 => "state", 2 => "country")
data_us_confirmed= data_confirmed_cases[ .== "US",:]
data_us_death = data_death[ .== "US", :] 
data_us_recovered = data_recovered[ .== "US", :]
date_strings = String.(names(data_confirmed_cases))[5:end];
format = Dates.DateFormat("m/d/Y");
dates = parse.(Date, date_strings, format) .+ Year(2000);
# plot the data on graph: 
p = plot(title="US COVID-19", ylabel = "US population", xlabel = "date")
us_confirmed_vec = vec(Array(data_us_confirmed[5:end]))
us_recovered_vec = vec(Array(data_us_recovered[5:end]))
us_death_vec = vec(Array(data_us_death[5:end]))
us_tot_recovered_vec = vec(Array(data_us_recovered[5:end])) + vec(Array(data_us_death[5:end]))
us_total_agents = us_tot_recovered_vec + us_confirmed_vec
#building susceptible:
us_population = 329686270
us_susceptible_vec = us_population .- us_total_agents
plot!(dates, us_recovered_vec, label = "Recovered")
plot!(dates, us_confirmed_vec, label = "Infected")
plot!(dates, us_tot_recovered_vec, label = "Infected + Death")
plot!(dates, us_death_vec, label ="Death")

From the graph, the total number of cases in the number of confirmed cases grows exponentially, which implies that our model captures some general description of the pandemic, as the solution to the model results in an exponential form. (i.e. This can be checked by plotting the log of the plot).

( In the dataset, Infected refers to the total number cases recorded, not active cases. And in the graph, Infected + Death is the recovered compartment discussed in section 1. )

Estimating the basic reproductive number

To solve for R0R_0 , we use the exact analytical solution of the evolution of the SIR model. (Harko, Lobo, and Mak 2014)

We solve for I(t).

We solve the model for parameters that minimizes the least-squares using LMA. Because it does not guarantee to find the global minimum, we pick a few random initial parameters to increase the likelihood of falling into a global minimum.

#using LsqFit;
# Set up (eq.8) with N = us_population: 
model(R, param) = 329686270 .- R - 329686270 .* exp.(-(param[1]/329686270).*R)
# Apply LMA:
params = []
active_infections_vec = us_confirmed_vec - us_recovered_vec
for i in 1:10
  pₒ = [randn()]
	global fit = curve_fit(model, us_recovered_vec, active_infections_vec, pₒ)
  push!(params, fit.param[1])
@show params'
#compute confidence interval at 0.05 confidence level:
@show confidence_intervals = confidence_interval(fit, 0.05)

Using the analytical method, we cannot find the actual values of β and γ, but we can find the R0R_0, which in this case of COVID-19 in the U.S is estimated to be 4.90 (95% CI 4.67294, 5.13887), statistically significant. Early estimations of the reproductive number, using a hybrid deterministic–stochastic SEIR (susceptible-exposed-infectious-recovered) model, sets its median value at 5.7 (95% CI 3.8–8.9) (Sanche, Lin, Xu, Romero-Severson, Hengartner and Ke 2020).

Visualize the fitted model

Due to its limitations, a deterministic SIR model weakly captures the details of the evolution of the pandemic. It should not be taken as an exact predicting tool for COVID-19, at least from our judgment (refer to section 4.). However, it still captures some general aspects of the model. For instance, it still captures the exponential growth characteristic of COVID-19 as seen in the plot below.

fitted_I = model(us_recovered_vec, [params[1]])
p = plot(title = "Active cases in the US", ylabel = "Active cases", xlabel = "date")
plot!(dates, active_infections_vec, label = "active I")
plot!(dates, fitted_I, linestyle = :dot, label = "fitted I")

General thoughts on using this model

Most about scientific modeling can be summarised with an aphorism commonly used in the statistics community, "All models are wrong, but some of them are useful." By informal definition, modeling is a process of creating a simple representation of complex systems that captures the general underlying truth about the system. The more complicated, the better the assumptions are, and the more the results are accurate. However, taking solid assumptions makes it hard, sometimes even impossible, to analyze. SIR model is built on assumptions (look at section 1) that are not a reasonable reflection of the real world. Beyond the assumptions, the model does not take into account changes in policies and measures put into effect to reduce the spread of COVID-19 and will not self-correct when given incomplete data(as is the case due to poor testing ability in the early days of the pandemic). This excludes the use of such simplistic models to compute or predict specific parameters such as death rate, infection rate, and the number of infectious people in 2 weeks. However, such a model can still be useful in skeptically estimating parameters such as the basic reproductive number.

Some of the issues mentioned of SIR are addressed using more complex models (stochastic derivations of SIR model, spatial models, or Bayesian tree model) and performing different tests(anti-body testing) to gauge the correct number of infections.


To check out the source code, click on the upward-pointing arrow of "show source code" if the source code is not visible.


  1. Harko, Tiberiu, Lobo, Francisco S.N., and Mak, M.K. (2014)“Exact Analytical Solutions of the Susceptible-Infected-Recovered (SIR) Epidemic Model and of the SIR Model with Equal Death and Birth Rates.” Applied Mathematics and Computation 236: 184–194. Crossref. Web.

  2. Gavin, Henri P.(2013) “The Levenberg-Marquardt method for nonlinear least-squares curve-fitting problems”.

  3. Kermack, W. O., McKendrick, A. G (1927). "A Contribution to the Mathematical Theory of Epidemics". Proceedings of the Royal Society A115 (772): 700–721. 

  4. Sanche S, Lin YT, Xu C, Romero-Severson E, Hengartner N, Ke R.(2020) High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2. Emerg Infect Dis.

Runtimes (1)