Getting a Clearer Picture of COVID-19 in Rio de Janeiro

Forecasting infections, mortalities and the impact of relaxing social distancing measures using a Seasonal Kinetic SIRD model

Published in

Analytics Vidhya

15 min readMay 27, 2020

What a year has 2020 has been so far right? We’re stuck at home, we can’t meet friends or family, most of our routines have been thrown out the window and every time we turn on the TV, check Twitter or any news outlet we get the latest news on the tragic COVID-19 pandemic. With my college classes moving to an online format, I decided to move from New York back to my home city of Rio de Janeiro, Brazil, until things get back to normal with hopes that staying with family would make quarantine much better and that social distancing wouldn’t be as tough back home.

I was right about staying with family, I could not imagine how it would be like staying at an empty college dorm throughout the later spring and summer, but boy was I wrong about the social distancing part. “Oh, we have close to no cases”, I thought, “we’ll be fine!”, I thought. As of the time of writing this article, Brazil is the #2 country in the world with the most confirmed COVID-19 cases and the US just issued a travel ban from Brazil over coronavirus concerns, just great right? Needless to say, social distancing measures have become a huge topic of conversation here much like in the US, with the city government in Rio de Janeiro recently suggesting that easing social distancing measures might be on the horizon. Like any Math major, I couldn’t help myself and decided to take a look at the data myself in an attempt to get a better understanding of what’s going on in Rio. I’ll try to keep a casual tone throughout this article in order to make it appealable to a wider audience, however, I’ll inevitably need to use some jargon here and there.

Before anything, I must state that I am no government organization, so you should not draw any definite conclusions about the state of coronavirus in Rio simply from my results. I am just an individual performing an analysis in his spare time, so I could very well be completely wrong. I might even have a bug in my code for all you know, but I guess you’ll just have to trust me (the GitHub repository with all the code used in this article can be found here).

Analysis Goals and Getting The Data

My objective was to find a model that fits the data and from which I could draw some interesting information about the pandemic’s behavior at the city level, such as estimates about the number of infections and mortalities, the impact of social distancing measures and R₀ estimates. Notably, the latter is the virus’ basic reproductive number, the number of people someone with COVID-19 infects on average until they recover, a very talked-about parameter in the medical community and if you’d like to learn more about it I’d highly recommend [1] by Jones.

The coronavirus data used for this article consisted of the number of confirmed COVID-19 cases and deaths in the state of Rio de Janeiro compiled by the Rio de Janeiro Healthcare Secretariat and accessed on May 26th, 2020, which can be accessed here and was reduced to cases in the city of Rio de Janeiro only, as that is our region of interest. The secretariat states that if a person confirmed to have COVID-19 hadn’t died or been hospitalized within 14 days of their initial symptoms they were considered to be recovered, thus that’s how the number of recovered individuals was computed.

Additionally, the Rio de Janeiro population estimate was acquired from the Brazilian Institute of Geography and Statistics [2], which was used to compute the number of people susceptible to the coronavirus assuming recovered individuals become immune to the virus.

The Model, Assumptions and Optimization

Model Definition

The model used here is a mixture of the ones employed by Danon et al. [3] and Caccavo [4] in their respectively brilliant papers, which I (rather uncreatively) dubbed as a Seasonal Kinetic SIRD model, which I’ll describe below.

Let S(t), I(t), R(t), D(t) respectively denote the number of individuals susceptible to COVID-19, number of currently infected individuals, number of recovered individuals and number of individuals who died from the disease at time t. Define N(t) = S(t)+I(t)+R(t)-D(t) to be the total city population at t. A SIRD model was used as a basis to capture these variables, in which a person in a susceptible state once infected transitions to the infected state, where they may infect other yet susceptible individuals, and then finally transitions into either the recovered or the deceased states. This can be visualized in the diagram below

Diagram showing possible progression path of an individual who becomes infected in the SIRD model.

Under the assumptions that will be listed afterwards, these quantities follow the system of ordinary differential equations

where β(t) is the rate of infection, γ(t) is the rate of recovery and μ(t) is the rate of mortality, all defined as the inverse of the average amount of time a person takes to transition into their respective states from a previous state at a time t. For example, β(t) is the inverse of the average amount of time it’d take a person to go from susceptible to infected at time t. The traditional SIRD model assumes these rates to be constant, but our seasonal kinetic prefix comes from the fact we let these rates take the time-dependent forms

Rate definitions for Seasonal Kinetic SIRD model

Let’s take some time just to know what each of these parameters mean, for which we need to know the rationale behind them. Here, the infection rate increases seasonally, reaching a seasonal peak in the Winter as goes the hypothesis that COVID-19 is more infectious in colder, drier weather, but also decreases exponentially over time as a result of social distancing measures⁴. The recovery rate increases logistically over time as to reflect healthcare systems progressively adapting to the pandemic⁴. Finally, the mortality rate decreases exponentially over time to a flat rate, meant to represent the adaptation of healthcare infrastructure as well as how less lethal strains of the virus may become comparatively more widespread as the pandemic goes on⁴. We then are able to intuitively define our parameters as:

β₀ : Initial base infection rate
τᵦ : Temporal lag term for how fast the infection rate decreases
m : Magnitude of seasonal increase in infection rate
γ₀ : Initial recovery rate
γ₀ + γ₁ : Final recovery rate
τᵧ : Temporal lag term for how fast the recovery rate increases
μ₀ + μ₁ : Initial mortality rate
μ₀ : Final mortality rate
τ_μ : Temporal lag term for how fast the mortality rate decreases (sorry about the poor formatting, blame Unicode I guess)

Assumptions

Finally, the following are the assumptions made by the Seasonal Kinetic SIRD model:

The total population remains constant and does not interact with people outside the modeled population.
Individuals may only become infected when interacting with infected individuals.
Individuals are homogeneously mixed and interact at random.
Time taken by an individual to transition between states is exponentially distributed.
Recovered individuals become immune to the disease and cannot be infected again.
Disease impacts all demographics equally.
Infection rates decrease over time, recovery rates increase over time and mortality rates decrease over time.

Some of these assumptions are of course not completely true at the micro level, however, at the macro level being dealt with the difference between our model and a more complex one becomes statistically insignificant. It must be noted that because of its reliance on these assumptions, this model is not appropriate for modeling very small communities.

Optimization Procedure

From what I’ve seen in the literature about COVID-19 modeling, there are 2 general approaches to using models like this: either determining the parameters from specific characteristics and performing Monte Carlo simulations, or fitting the ODE system to the data and computing the solution. I chose to go with the latter since it allows our parameters to be more specific to Rio de Janeiro, where things like infection rate and temporal lags might be very different from places like Italy or the US.

This is where things get a bit technical, so feel free to skip to the results if these details are not of particular interest to you. As with any optimization scheme, we need a loss function to determine how well our model fits the data. Borrowing once again the ideas of Caccavo [4], due to the unavoidable noisiness of pandemic data the weighed bisquare loss function was chosen, as it allows for smoothing out the effects of outliers. Let θ be the set of model parameters, tᵢ ∈ ℤ₊ be the ith data point time, and xᵢ , f(tᵢ | θ) ∈ ℝ₊⁴ respectively be the ith data point and the model prediction at time tᵢ given parameters θ. Note that using Einstein notation, xᵢ¹ = S(tᵢ), xᵢ² = I(tᵢ), xᵢ³ = R(tᵢ) and xᵢ⁴ = D(tᵢ). The weighed bisquare loss function at the ith data point is then defined as

Weighed bisquare loss function definition

where k = 4.365 ⋅ MAD(εʲ), with MAD(εʲ) being the mean absolute deviation of the residuals of the jth variable, and wⱼ is a weight assigned to each variable meant to counterbalance the effects of different variables on the loss function due to their different scales, being defined as wⱼ = max { max(|εⁱ|) : 1 ≤ i ≤ 4} / max(|εʲ|). Note that we use the residual of the variable finite differences, which helps optimize the model by reducing correlation, and only the infected, residual and dead populations, since then the susceptible population follows.

When seeking to minimize the loss function, we could in theory use any local optimization procedure such as Quasi-Newton or Simplex methods as the bisquare function is concave and thus any local minimum must also be a global minimum. However, when attempting this I ran into several problems dealing with variation in the optimum parameters depending on the initial guess, which I suspect might be caused by the function being non-differentiable at |εʲ| = k and the zero derivatives when |εʲ| > k. As such, three different global optimization methods were applied: Differential Evolution (DE), Basin-Hopping (BH) and Dual Annealing (DA). If you’d like to learn about the specifics of these methods, I’d recommend [5], [6] and [7] respectively.

Results and Forecasting

Optimal Parameter Results

Using the excellent lmfit Python library, our Seasonal Kinetic SIRD model was fit to all the available data and the final parameters and models’ BIC, a measurement of goodness of fit that decreases with better fitting models, are shown in the table below.

Table comparing fitted model parameters and BICs for different optimization routines

Although Differential Evolution displays the best BIC value out of all fits, they are close enough to each other that I find it still interesting to consider the results from the other fitted parameters too. The plots of the model predictions compared to the empirical data are shown below.

Model predictions using different fitted parameters compared to data

Forecasting of Populations, Rates and R₀

With our freshly fit parameters, we are able to forecast the number of susceptible, infected, recovered and deceased people for any time in the future, but here I chose to constrain the predictions to only until the end of 2020. The forecasts about the number of active infected individuals, cumulative number of infected individuals and mortalities, which I consider to be the main variables of interest, are shown below.

Model predictions using different fitted parameters along with estimate of total number of infections and mortalities until the end of 2020

We see that the Basin-Hopping parameters seem to be the most optimistic when considering the total number of mortalities, while the Dual Annealing parameters are more so in the number of total infections. Either way, it is clear that the Differential Evolution parameters, which achieved the best fit, predict the highest number of infections and mortalities.

Next, the infection, recovery, mortality rates and R₀, defined as R₀ ≡ β(t) / (γ(t) + μ(t)), are plotted below.

Estimates of infection, recovery and mortality rates and R₀ using fitted parameters

This is where we begin to see some qualitative difference between the different sets of model parameters. All parameter sets predict a sharp increase in the recovery rate in the month of April and a progressive decrease in infection rates over time, although infection rate predicted by the Differential Evolution parameters begins at a higher value and displays a much sharper decrease. It must also be noted that all fitted models suggest a marginal influence of seasonality on the infection rate, which is not very surprising given that the temperature variation in tropical regions like Rio are much less pronounced than somewhere like Europe or the Northeastern US. Furthermore, when looking at the mortality rates, both the Basin Hopping and Dual Annealing parameters suggest an approximately constant mortality rate, while Differential Evolution results in a much larger initial rate which exponentially decreases to comparable levels around April.

It is however, in my opinion, the R₀ plot that provides us with the most interesting information. All parameters show a sharp decrease in the R₀ value around the month of April, which about a month after the beginning of state-wide social distancing protocols and generally coincides with the periods of peak isolation levels in the city. This heavily suggests the effectiveness of the social distancing measures adopted by the state and city governments, especially when considering that the R₀ drops to below 1.0, in which case the pandemic is guaranteed to contract as each coronavirus carrier infects less than one person throughout their infectious period. The initial behaviors of the R₀, depending on the optimization, may also provide some interesting hints at how the pandemic behaved in the early stages. Both Basin-Hopping and Dual Annealing parameters indicate that the pandemic began spreading very fast and, as individuals began to isolate and distance themselves, the velocity of the virus’ spread began decreasing, with a sharp drop when quarantine protocols were implemented. The Differential Evolution parameters, on the other hand, suggest that the pandemic began with R₀ already below 1.0, meaning it was still very controllable during early stages. However, it continued to grow until stabilizing shortly after the beginning of social distancing measures, which was also followed with a sharp decrease in the R₀ value.

Reopening and Going Back to Normal

There’s no doubt that everyone wants quarantine to be over and things to go back to normal, however, reopening the city and economy too early could greatly risk undoing all progress made by social distancing policies. It was under that light that I was prompted to analyze what kind of impact relaxing quarantine protocols at various levels would have, being greatly inspired by (and largely a mirror of) the one performed by Jones and Fernándes-Villaverde [8]. Once again, I’m no government institution and this should be taken with a large pinch of salt, I could very well be wrong!

Some people here have been very vocal about how they want quarantine to be immediately lifted, no buts! It seems reasonable to assume the model parameters in such a case would (in the best case scenario) be the same as the ones we have already found, with the exception that β(t) = β₀ (1 — (1 + m/2 cos(2 π t / 365))), which removes the term supposed to represent social distancing effects. Well, let’s see what the Seasonal Kinetic SIRD model predicts would happen in such a case.

Model forecast of active infections and mortalities assuming complete suspension of social distancing protocols along with estimates for the total number of infections and deaths until the end of 2020

Yikes! Yeah, that probably isn’t a good idea. We see that not only does the infection and death toll skyrocket, but the pandemic also extends itself well into 2021 in some scenarios. Instead, let’s do this the smart way.

Our main point of interest then becomes the herd immunity, roughly speaking the concept that the more people that get infected and recover, the fewer people the virus can infect, which causes a reduction in how many new people are infected. To get a better grasp of how herd immunity impacts the virus’ growth, I personally find it helpful to view the infection process as a combination of random variables, even if you don’t have any experience with statistics.

Let X(t) be the (random) number of people an infected individual transmits the virus to throughout their infectious period at t. It follows from the definition of R₀ that E[X(t)] = R₀(t), where E[⋅] is the expected value. However, the person who gets the virus only becomes infected if they are in the susceptible population, which is true with probability S(t) / N(t). Thus, letting Y(t) be a Bernoulli (binary) random variable such that P(Y(t) = 1) = S(t) / N(t) and P(Y(t) = 0) = 1 — S(t) / N(t), the average number of people an infected person actually infects at t is given by

Effective reproductive number definition

which is known as the effective reproductive number. We want to see how much we can relax quarantine, which inevitably increases the R₀, but still keep the Rₑ below 1 so that the virus is shrinking, which can be achieved if R₀(t) < N(t) / S(t).

We assume that the R₀ reaches its maximum value when quarantine is fully relaxed and that it increases proportionally to increases in activity. Then, letting R’₀ be the current R₀ as of May 26th, we get that the maximum possible percent activity increase is given by

Definition of maximum possible activity increase while still suppressing the virus

The plots of this maximum increase for the different fitted parameters are given below.

As we can see by the plots, it definitely may not be all bad news for Rio! It seems as if according to Dual Annealing, which gives the most optimistic parameters infections wise, we’d be able to return just shy of 15% back to normal right now. On the other hand, Differential Evolution, which predicts more infections and thus more immune individuals, suggest that right now we’d be able to get more somewhere 20% and 25% back to normal. These estimates continue to improve the longer we look into the future, however, none of them come even close to suggesting a 100% return to normal is possible until there are no more active infected individuals.

Additionally, let’s consider a model where the recovery and mortality rates continue to behave as determined, but authorities are able to control the infection rates so that the R₀ is kept within a percentage margin δ of that necessary for virus suppression. Note, I have no idea how such infection rates could be achieved, after all I’m just a Mathematics student, not a policy maker. Below, the plots of the predicted pandemic end dates, total infections and total mortalities according to the containment margin δ are shown below

Estimates of pandemic end dates, total infections and mortalities according to percent margin kept away from maximum R₀ for virus suppression

These plots show us the “no free lunch” nature of dealing with a pandemic. The more we reduce δ, i.e. the more we reopen and get closer to allowing the virus to start growing, the more infections and deaths we’ll have, with the pandemic itself stretching into 2021 for values below 20%.

What approach to take, and how to effectively execute it, is the responsibility of the Rio de Janeiro authorities. Being born and raised carioca, Rio is somewhere I keep close to heart at all times and seeing it be ravaged by COVID-19 has been a true tragedy. Amidst the current waves of misinformation, analyzing the information here presented has personally helped me get a clearer picture of what’s going on and how coronavirus might behave on a city level, and I hope the same has been true for yourself.

References

[1] J.H. Jones, 2007. “Notes on R₀”. Stanford University Department of Anthropological Sciences. https://web.stanford.edu/~jhj1/teachingdocs/Jones-on-R0.pdf

[2] Instituto Brasileiro de Geografia e Estatística, 2019. “Population Estimates:. https://www.ibge.gov.br/estatisticas/sociais/populacao/9103-estimativas-de-populacao.html?=&t=resultados

[3] L. Danon et al., 2020. “A spatial model of CoVID-19 transmission in England and Wales: early spread and peak timing”. https://doi.org/10.1101/2020.02.12.20022566

[4] D. Caccavo, 2020. “Chinese and Italian COVID-19 outbreaks can be correctly described by a modified SIRD model”. https://doi.org/10.1101/2020.03.19.20039388

[5] A. K. Qin, V. L. Huang and P. N. Suganthan, “Differential Evolution Algorithm With Strategy Adaptation for Global Numerical Optimization,” in IEEE Transactions on Evolutionary Computation, vol. 13, no. 2, pp. 398–417, April 2009, doi: 10.1109/TEVC.2008.927706.

[6] G. G. Rondina and J.L.F. Da Silva. “Revised basin-hopping Monte Carlo algorithm for structure optimization of clusters and nanoparticles.” Journal of chemical information and modeling 53.9 (2013): 2282–2298.

[7] C. Tsallis, and D. A. Stariolo. “Generalized simulated annealing.” Physica A: Statistical Mechanics and its Applications 233.1–2 (1996): 395–406.

[8] C.I. Jones and J. Fernándes-Villaverde, 2020. “Estimating and Simulating a SIRD Modelof COVID-19 for Many Countries, States, and Cities”.