## Abstract

The variant of concern (VOC) P.1 emerged in the Amazonas state (Brazil) and was sequenced for the first time on 6-Jan-2021 by the Japanese National Institute of Infectious Diseases. It contains a constellation of mutations, ten of them in the spike protein. Consequences of these mutations at the populational level have been little studied so far. From December-2020 to February-2021, Manaus was devastated by four times more cases compared to the previous peak (April-2020). Here, data from the national health surveillance of hospitalized individuals and frequency of the P.1 variant were analysed using a model-based approach to estimate P.1 parameters of transmissibility and reinfection by maximum likelihood. Sensitivity analysis was performed changing pathogenicity and the period analysed (including/excluding the health system collapse period). The new variant transmissibility was found to be about 2.5 times higher (Confidence Interval (95%CI): 2.3–2.8) compared to the previous variant in Manaus. A low probability of reinfection by the new variant (6.4%, 95%CI: 5.7–7.1%).) was estimated, even under initial high prevalence (68%, 95%CI: 63– 74%), by the time P.1 emerged. Consequences of a higher transmissibility were already observed with VOC B.1.1.7 in the UK and Europe. Urgent measures must be taken to control the spread of P.1.

## Introduction

The Japanese National Institute of Infectious Diseases identified the new P.1 SARS-CoV-2 variant from travelers returning from Amazonas State, Brazil on 6-Jan-2021 (1). P.1 was first reported in Manaus, the Amazonas state capital city, on 11-Jan-2021 (2). Later, it was identified in samples collected since 6-Dec-2020 from Manaus (3). According to phylogenetic studies, P.1 likely emerged in the Amazonas state in early (3) or late (4) November-2020. This variant contains a constellation of mutations, ten of them in the spike protein, and shares mutations with other variants of concern (VOCs) previously detected in the United Kingdom and South Africa (B.1.1.7 and B.1.351, respectively), including E484K, K417T, N501Y, and a deletion in the orf1b protein (del11288-11296 (3675-3677 SGF)) (2).

The Coronavirus disease 2019 (COVID-19) outbreak in Manaus (April-May 2020) was followed by a period of high but stable incidence. Then, from December 2020 to February 2021 the city was devastated by a new outbreak that caused a collapse in the health system and shortages of oxygen supply. (5). The frequency of P.1 increased sharply from 0% in November 2020 to 73% in January 2021 (4). The pathogenicity of the P.1 variant is still unknown, although recent studies point to increased viral load in individuals infected with the new variant (4). We analyzed national health surveillance data of hospitalizations and frequency of P.1 sequences from residents of Manaus city using a model-based approach to estimate P.1 transmissibility and reinfection.

## The study

### Dataset

We used the Brazilian epidemiological syndromic surveillance system for influenza, SIVEP-Gripe (publicly available at https://opendatasus.saude.gov.br), to track COVID-19 hospitalized cases. All hospitalized patients with Severe Acute Respiratory Infections are reported to SIVEP-Gripe with symptom onset date and SARS-CoV-2 test results. Data for hospitalized COVID-19 cases among residents in Manaus from 01-Nov-2020 to 31-Jan-2021 was obtained from SIVEP-Gripe database from 15-Feb-2021. The hospitalized cases of the last 10 weeks in the time series were nowcasted (6) to correct for notification delay. Time-series of frequency of sequenced genomes identified as P.1 in Manaus were extracted from published datasets (3,7).

### Model

A deterministic compartmental model (Figure 1) was developed to model the infection of *Susceptible* individuals moving to the *Exposed* (pre-symptomatic) compartment, which can progress to three alternative compartments: *Hospitalized* (severely ill), *Infected* (symptomatic but non-hospitalized), and *Asymptomatic*. Eventually, individuals move to *Recovered* or *Deceased*. Two variants are considered: 1-non-P.1 (“wild-type”) and 2-new/P.1. The latter is assumed to infect *Recovered* individuals previously infected by the wild-type, and no reinfections of wild-type due to waning immunity occur. Compartments were stratified into three age categories: young (<20 years old), adult (≥20 and <60 years old) and elderly (≥60 years old), with different rates for outcomes. The key parameters of *relative transmissibility* and *reinfection probability* were estimated by a maximum likelihood fitting to the weekly number of new hospitalizations and to genomic surveillance data (3,7). Three additional model parameters with unknown values were estimated. The remaining parameters (24 out of 29) were fixed, using current values from the literature (see Table A.I in the Appendix for values and references). Sensitivity to higher pathogenicity of the variant P.1 was explored by repeating the fit assuming higher infection hospitalization rate (SA1). The sensitivity to the period analysed was also explored by another fit excluding the health system collapse period (SA2). Further details of the model and fitting methodology are available in the Appendix.

## Results

The estimated transmissibility of P.1 was 2.5 (95% Confidence Interval (CI): 2.3–2.8) times higher compared to the wild-type variant, while the reinfection probability due to the new variant was 6.4% (CI:5.7–7.1%). The model fitting also estimated that, at the time the new P.1 variant emerged, the prevalence of previous infection to the wild-type variant was 68% (CI:63–74%), and that the number of cases by the wild-type variant were increasing with an estimated daily intrinsic growth rate of 0.053 days-1 (CI:0.047–0.058 days-1). See Table 1 and Figure 2 for fitted parameters results. The results were robust and confirmed by sensitivity analyses as shown in Table 2.

## Conclusions

COVID-19 hospitalizations and frequency of the P.1 variant in clinical samples showed a sharp increase in Manaus, Brazil, starting November 2020. The fitted model suggests this joint increase was the result of the emergence of P.1, estimated to be 2.5 times more transmissible than the wild-type variant. The spread of P.1 occurred despite an existing high estimated prevalence of infection by the wild-type virus and a small probability of reinfection by the invading variant. The pathogenicity of P.1 is still unknown, but even assuming higher infection hospitalization rates for the P.1 variant, the estimated transmissibility remained above twice that of the wild-type variant.

Two recent studies analysed genomic data of SARS-CoV-2 from Manaus evaluating the transmissibility of the new variant (3,4). Faria and collaborators integrated mortality and genomic data and, using a semi-mechanistic Bayesian model, estimated a transmissibility 1.4–2.2 times higher and 25–61% evasion of protective immunity related to the variant P.1 (3). Naveca and collaborators estimated a 2.2 times higher effective reproduction number for the P.1 variant using phylogenetic methods, and suggested that P.1 is at least two times more transmissible than the parental lineage, assuming reinfections are rare (4). The present work follows a different approach that can be defined as an epidemiological, model-based, and data-fitting approach. Notably, the three different approaches all estimated higher transmissibility of the P.1 variant.

Data from the Amazonian region are scarce. The existing prevalence data were obtained from convenience samples, and P.1 frequency over time was based on small sample sizes; the former was included in the parameters fitted, while the uncertainty of the latter was explicitly considered. Epidemiological data are probably incomplete, since the analysed data overlap with the period of the health system collapse, posing limitations to this study. This was minimized by removing the collapse period in the sensitivity analysis (SA2) and the results were still robust. Due to lack of data evaluating waning immunity, the present work does not include reinfection by the wild-type on recovered individuals. If waning immunity is an important aspect, the transmissibility might be lower than estimated in this study. However, a recent study considering this possibility still estimated a high transmissibility for the P.1 variant (3).

The threats of a highly transmissible variant have already been observed with VOC B.1.1.7 in the UK, USA and Europe (8). Higher transmissibility of the P.1 variant raises strong concerns of swift upsurges in the number of cases once P.1 reaches community transmission across Brazil and other countries. The P.1 variant has already been detected in at least 25 countries (8). This calls for urgent i) pathogenicity studies of the P.1 variant, since greater transmissibility and pathogenicity can drive even well-prepared health systems to collapse; ii) immunological studies to understand the role of waning immunity in emergence of variants after a long period of epidemic control; and finally iii) mitigation measures to control the spread of P.1.

## Data Availability

All data used are publicly available. The data sources are described in the manuscript and in supplementary file.

## Appendix

In order to estimate key parameters of the variant of concern (VOC) P.1, we developed a model and fitted it to time-series data on the number of hospitalized cases and frequency of the P.1 variant. Section I describes the model, section II relates the values of the parameters taken from the current literature, section III describes the contact matrix used, and finally section IV describes the treatment of case data (subsec. IV-A), the choice of initial conditions (subsec. IV-B]), the fitting procedure (subsec. IV-C) and the sensitivity analysis evaluated regarding pathogenicity and data period analysed (subsec. IV-D).

### I. Model Equations

The model is an extended Susceptible, Exposed, Infected, and Recovered (*SEIR*) model that comprises susceptible (S), pre-symptomatic (E), asymptomatic (A), mild symptomatic (I), severe/hospitalized (H), recovered (R) and deceased (D) compartments. These compartments are duplicated to account for a second variant of SARS-CoV-2, and each of them is stratified into three age classes: young (< 20 years), adults (20-59 years), and the elderly (≥ 60 years). The “wild-type” classes represent all non-P.1 variantspresent, which do not seem to be variants of concern.

We assume that the second variant is capable of reinfecting individuals who have recovered from infection by the wild-type variant while the inverse is not possible; in the absence of data indicating this possibility, allowing reinfection by the wild-type variant on recovered of infection by P.1 would have negligible effect due to the small time window (3 months) considered in the present work. We also consider that a variant is not capable of reinfecting individuals recovered from the same lineage. Our model does not include vaccination due to low rates of vaccination in Brazil during the study time period.

To model the virus spread in the population, we assume that asymptomatic individuals have equal infectiousness compared to symptomatic ones, while pre-symptomatic individuals have reduced infectiousness represented by ω. To model behaviour, we assume that symptomatic individuals self-isolate themselves to some degree, reducing their contacts by *ξ*. Individuals with severe disease have greater isolation *ξ*_{sev} due to hospitalization. The daily contacts between each age class is represented by the matrix *Ĉ*. The force of infection *λ*_{k} for each variant *k* is defined below:

The complete system of equations is given by:
where *C*_{1} ad *C*_{2} are the cumulative hospitalization cases reported, and each variable of the system (*S, E*_{k}, …, *C*_{k}) is a vector containing each age class, e.g., *E*_{1} = (*E*_{1,y}, *E*_{1,a}, *E*_{1,e})^{T}. The equations were numerically solved by the package developed by Soetaert et al (9).

### II. Parameterization of the model

Table A.I presents parameters considered for the wild-type variant. The parameters for the P.1 variant are the same except for those considered in the model fitting.

### III. Contact Matrices

Our model includes three age group categories: young ([0 − 19]*y*. *o*.), adults ([20 − 59]*y*. *o*.) and elderly (greater than 60*y*. *o*.). To model contacts between these groups we use estimated contact matrices computed by Prem et al (17), but since the original matrices use five-year age bins going up to 95+ years, we aggregate classes leading to a 3 × 3 matrix in the following way:

Let *A, B* be sets of indexes forming age groups (not necessarily of equal sizes), *x*_{i,j} denoting contact between age groups *i* and *j* in the original matrix, *d*_{i} denoting population size of the age group *i*. The new contact matrix *Ĉ* is given by:
where *A**, *B** denotes a new indexation rule. Note that the contact matrices depend on local demographics and therefore must be computed for each place of study.

### IV. Data Analysis Procedures

#### A. Nowcasting

Data used in parameters estimation were collected from the national public health system of severe acute respiratory illness (SARI) surveillance database, named *Sistema de Vigilância da Gripe - SIVEP-Gripe*. In this system, reporting of cases can be delayed for several reasons, including the notification system itself and confirmation of RT-PCR test results. The nowcasting procedure estimates, based on the past delay distribution, the number of cases that already occurred but were not yet reported. A window of 10 weeks is the acting window on the series, since delays greater than this are rare.

Nowcasting requires a pair of dates: (i) onset date of the event and (ii) report date of the event. The delay distribution is modeled as being best described as a Poisson distribution for days since the onset date to the report date. We considered *the first symptoms date* as the onset date. For the report date, we used the latest between *the test result date* and *the clinical classification date*. The nowcasting algorithm were developed by McGough et al 2020 (6), and implemented in the *NobBS* (Nowcasting by Bayesian Smoothing) package in.

#### B. Initial Condition Estimation

The model requires appropriate mid-epidemic initial conditions in order to give relevant results. In the model, the number of new hospitalizations at a given time – *h*_{new}, is directly proportional to the number of exposed individuals at that time, therefore data was used to get an approximation of the number of exposed people. Also, to quantify the number of people belonging to the recovered class, prevalence was used.

We can estimate the appropriate initial conditions by finding an approximation for our model that relates more directly to the available data in each class. In the absence of the variant P.1, the model has four classes of infected compartments, namely *y* = (*E*_{1}, *A*_{1}, *I*_{1}, *H*_{1})^{T}, and another three classes, represented by *z*, i.e., *z* = (*S, R*_{1}, *D*_{1})^{T}.

To that effect, we can write the system as:
where *F* comprises all entries of new Infected, coming from classes *z*, whilst *G* accounts for the transitions within infected classes and also recovery and death from the disease. Then, to find a good approximation for a small time window, we perform a linearization of our model around a point (*y, z*). Keeping *z* fixed, we get
where and *Ĝ* are the linearized matrices arising from the functions *F* and *G*, respectively. The only entrance of new infected comes from the *βSλ*/*N* terms in the *Ė*_{1} = (*Ė* _{1,y}, *Ė*_{1,a}, *Ė* _{1,e})^{T} equations (sub-indexes are *y* young, *a* adults and *e* elderly), then, the only non-zero elements of are in its first 3 lines. Before proceeding, it’s useful to define
which allow us to write
*Ĝ* contains the terms of Exposed, *E*_{1}, the 3 possible forms of the disease considered in the model, that is *A*_{1}, *I*_{1} and *H*_{1}, as the terms in its first 3 rows, whilst the remainder of its main diagonal contains terms of recovery and death. For simplicity, every constant (or vector for the terms with *σ*) in *Ĝ* expression (A1) should be thought as diagonal matrices with its elements given by the constants (or vectors) and every 𝕆 is a 3-dimensional square matrix where all entries are null.

The linearization above implies that, for a small time interval, *y* has an exponential behavior and that the eigenvalues of are related to the exponential growth rates. Therefore, a short time after the beginning of the epidemic, the largest eigenvalue should be the one to dominate. So the exponential growth rate of the wild-type variant – *r*, can be matched to the largest eigenvalue of to obtain an estimate for *β*. The eigenvector associated with the largest eigenvalue gives the proportions of infected classes, which, together with the estimated number of exposed individuals , results in an approximation for the number of people in the other infected classes.

Given a *β*, the largest eigenvalue of the linearization matrix is computed using the eigs function of the package *rARPACK* (18) and we find the *β* that gives *r* as the largest eigenvalue through bisection root finding. Finally, subtracting the number of recovered and infected from the total population gives the number of susceptible individuals.

#### C. Likelihood Estimation

Given the cumulative daily curves of hospitalization for wild-type variant, *C*_{1}, and P.1 variant, *C*_{2}, we can obtain the daily variation of each curve, namely and . Those curves are summed up to give the total number of weekly new cases:
where *τ* is a discrete index given in weeks.

To calculate the frequency of P.1 in a given time period *T*, we use the proportion of new cases in this period from the wild-type and P.1 variant as follows:
where *t* ′ is a discrete index given in *T* periods.

The time period *T* depends on the dataset of genome sequences: it is daily in Faria et al (3) and monthly in Fiocruz website (7).

Using maximum likelihood, we fitted the model by estimating five parameters, namely, the relative transmissibility (*Δβ*), the reinfection probability of P.1 (*p*), initial total prevalence (*ρ*^{0} = [*R*/*N*]_{t=0}), initial fraction of cases that were caused by the new variant (*P*^{0}), and intrinsic growth rate of the wild-type variant (*r*). The parameter *r* incorporates effects related to contact rates for the wild-type variant, such as non-pharmacological interventions relaxation, elections, and others.

Number of hospitalization cases were assumed to follow a Poisson distribution, with expected value given by equation (A2). The recorded number of P.1 in genome samples was assumed to follow a binomial distribution with an expected value equal to the product of the total number of genome sequences sampled in each date and the proportion of P.1 cases (equation (A3)). The log-likelihood function for the model fitting was then:
where *Pois* is a Poisson distribution with parameter *λ, x*_{i} is the number of recorded hospitalizations in week *i, Bin* is a Binomial distribution with parameters *N* (total number of trials) and π^{j} (probability of success at each trial), *n*_{j} is total number of sequences in clinical samples in week or day *j, y*_{j} is the number of P.1 sequences in each of these samples, and *θ*(.) is the *logit* function.

The model was then fitted by finding the values of the five above mentioned parameters that maximize the log-likelihood function (equation (A4)), using the function mle2, from the package *bbmle* (19).

To find starting values for the optimization performed by mle2 we calculated the log-likelihood function for a sample of one million combinations of parameters values within reasonable ranges. The sets of parameters values that provided the two higher values of the log-likelihood function were then used as starting values for the computational optimization.

The confidence intervals for the expected number of cases and frequency were estimated from 10000 parametric bootstrap samples assuming that the estimated parameters follow a multivariate normal distribution. The parameters of these multivariate distributions were the estimated values and estimated variance-covariance matrix of the parameters. For each sampled combination of parameters, the expected values were calculated and the confidence interval was estimated as the 2.5% and 95% quantiles of the distribution of bootstrapped expected values.

#### D. Sensitivity analysis

The model fitting assumed a constant Infection Hospitalization Rate (IHR, parameter *σ*) for each age group over time for both variants. An increase in IHR caused by P.1 would increase the number of hospitalizations even without any increase in transmissibility or reinfection. Because the pathogenicity of the variant P.1 is unknown, the model fitting was repeated assuming that the odds ratio of the IHR in each age class was twice for P.1 cases compared to wild-type variant cases (SA1). Moreover, as the collapse of Manaus health system hindered hospitalizations of new severe cases and may have affected case recording in surveillance databases, the model fitting was repeated considering only the period prior to the collapse (10 January 2021) (SA2).

## Acknowledgments

We are grateful for the collaborative work of the entire group of the Observatório COVID-19 BR, which has been tirelessly studying the situation of the SARS-CoV-2 epidemic in Brazil. The authors also thank the research funding agencies: the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (Finance Code 001 to FMDM, LSF and TPP), Conselho Nacional de Desenvolvimento Científico e Tecnológico – Brazil (grant number: 315854/2020-0 to MEB, 141698/2018-7 to RLPS, 313055/2020-3 to PIP, 312559/2020-8 to MASMV, 311832/2017-2 to RAK, 305703/2019-6 to AAMS) and Fundação de Amparo à Pesquisa do Estado de São Paulo-Brazil (grant number: 2019/26310-2 and 2017/26770-8 to CF, 2018/26512-1 to OC, 2018/24037-4 to SPL and contract number: 2016/01343-7 to RAK).