HIV-1 founder variant multiplicity is determined by the infection stage of the 1 source partner 2

17 During sexual transmission, the large genetic diversity of HIV-1 within an individual is frequently 18 reduced to one founder variant that initiates infection 1 . Understanding the drivers of this bottleneck is 19 crucial to develop effective infection control strategies 2 . Genetic characteristics of the potential founder 20 viruses and events in the recipient partner are both known to contribute to this bottleneck, but little is 21 understood about the importance of the source partner 3 . To test the hypothesis that the source partner 22 affects the multiplicity of HIV founder variants, we developed a phylodynamic model calibrated using 23 genetic and epidemiological data on all existing transmission pairs for whom the direction of transmission 24 and the infection stage of the source partner are known. Our results demonstrate the importance of 25 infection stage of the source partner, and not exposure route, in determining founder variant multiplicity. 26 study finds a median of one founder variant and a maximum of 11, with little difference between HET and MSM risk groups. When only multiple variant transmissions are considered, our study finds a median of 2-3 founder variants. These values are consistent with a previous pooled analysis using results from four analyses that used the current gold-standard SGA combination approach as above 9


Specifically, acquiring infection from someone in the acute (early) stage of infection increases the risk of 27
multiple variant transmission when compared with someone in the chronic (later) stage of infection. This 28 study provides the first direct test of source partner characteristics to explain the low frequency of 29 multiple founder strain infections and can inform clinical intervention study design and interpretation. The factors leading to the diversity bottleneck during sexual transmission can be broadly categorized as 39 those determined by the source partner-such as viral load and viral diversity available for transmission, 40 those determined by the recipient partner-such as target cell type and availability in the genital or rectal 41 mucosa (e.g. 3,5,11 ), and those connected with viral characteristics-such as glycosylation profiles and cell 42 tropism (reviewed in 12 ). While the impact of the recipient partner and the characteristics of transmitted 43 variants have been widely discussed, little is known about how the source partner affects the viral 44 diversity bottleneck. In particular, despite the importance of infection stage as a driver of HIV 45 transmission-that is, the length of time between the source partner becoming infected and transmission 46 to their partner-there is no empirical evidence to suggest how this influences the viral diversity 47 bottleneck. This gap has arisen because analyses are routinely conducted on individuals without 48 information on the partner from whom they acquired infection. Phylogenetic analyses now offer a 49 possible solution to this impasse. 50 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https: //doi.org/10.1101/19013524 doi: medRxiv preprint Phylogenetic trees are representations of the ancestral relationships of organisms with the tips of the tree 52 representing those that are sampled, the internal nodes their inferred common ancestors, and the branches 53 as the evolutionary pathways between these actual and inferred individuals. When phylogenetic trees are 54 constructed using sequence data from both partners in an HIV transmission pair, the relationship between 55 the evolutionary histories of both set of viral samples may reflect epidemiological relationships between 56 the two individuals [13][14][15] . Previous modelling studies suggest that by assigning the relationship between 57 the evolutionary histories of both partners as one of three topology classes-monophyletic-monophyletic 58 (MM), paraphyletic-monophyletic (PM), or a combination of paraphyletic and polyphyletic (PP)-59 epidemiological information can be inferred, such as the direction of transmission 13 , as well as 60 evolutionary information, such as determining the number of transmitted variants 16 . That is, the number 61 of monophyletic clusters in a PM (one) or PP (more than one) tree can be interpreted as the minimum 62 number of transmitted lineages (Fig. 1A). In practice, however, many factors may influence 63 epidemiological interpretations from phylogenetic trees such as sampling times, sampling density of the 64 viral populations and phylogenetic signal 17,18 . 65

66
Here we present a data-driven phylodynamic approach to overcome these empirical and methodological 67 issues to evaluate the impact of the source partner's infection stage and route of exposure on the HIV 68 diversity bottleneck (Fig. 1B,C). We first retrieved all available genetic and epidemiological information 69 from published HIV sexual transmission pairs where the direction of transmission is known, and kept for 70 further analysis those pairs for whom transmission could be classified as having occurred in the source 71 partner's acute stage (≤90 days after his/her infection) or chronic stage (later than 90 days after his/her 72 infection). After further stratifying pairs into heterosexual (HET) and men-who-have-sex-with-men 73 (MSM) risk groups, we found a significant difference in the timing of transmission between the two risk 74 groups. Specifically, 10 of 36 MSM pairs were the result of acute stage transmission compared with 1 of 75 76 of HET pairs (Fig. 2). 76 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi. org/10.1101/19013524 doi: medRxiv preprint We then performed Bayesian phylogenetic tree reconstruction on the genetic sequences of the 89 transmission pairs and classified the topology class of each tree in the posterior distribution as 90 monophyletic-monophyletic (MM), paraphyletic-monophyletic (PM) or paraphyletic-polyphyletic (PP). 91 The most likely topology class was PM (65% and 61% for HET and MSM, respectively), but with a 92 higher number of PP trees in the MSM group (P=0.056, Fig. 2). This result has previously been reported 93 as indicative of a higher number of founder variants for MSM 16 . However, when we stratify the topology 94 class by whether the source partner was in acute or chronic infection at the time of transmission, our 95 results indicate that the infection stage of the source is the primary driver for any observed differences in 96 topology class. Specifically, there is no difference between the HET and MSM groups in the PM/PP 97 topology class ratio when transmission occurs in the chronic stage of infection (P=0.570). Note that only 98 one HET transmission occurs during the acute stage, and the topology class for this pair is PP. These 99 results remain qualitatively consistent when only data were analysed from the 66% of transmission pairs 100 for whom the posterior trees gave a certainty of over 95% for the most frequent topology class (Fig. S3). 101 These results indicate that infection stage of the source partner, and not risk group per se, influences the 102 diversity bottleneck at transmission. 103

104
To test whether these empirical findings are indicative of a smaller diversity bottleneck in the chronic 105 stage of HIV infection, we developed a phylodynamic framework in which we simulated the 106 epidemiologic characteristics of each HET and MSM transmission pair, the timing of their sequence 107 sampling, the transmission of virus particles, and the within-host genetic evolution in both the source and 108 recipient (Fig. 1B). Specifically, using the epidemiological information from the transmission pairs, we 109 simulated phylogenies under a coalescent model before generating genetic sequences from these 110 simulations and performing Maximum Likelihood (ML) phylogenetic reconstruction on these simulated 111 sequences. We classified each of these simulated trees as MM, PM or PP and determined the frequency of 112 each topology class for each simulated transmission pair across all the simulated sequences. However, as 113 we could not directly observe the number of virus particles that are transmitted between source and 114 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint  to the topology class distribution from the empirical phylogenetic trees using maximum likelihood 125 inference, we then determined the most likely number of transmitted virus particles for each transmission 126 pair and used this best fit model for further analysis. Note that two or more virus particles may have the 127 same genetic sequence, and would constitute a single founder variant (or haplotype), discussed later. 128

129
Our fitting procedure selects a best fit model that clearly delineates between transmission pairs between 130 whom one virus particle is transmitted (75% of pairs) and those between whom more than one virus 131 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint particle is transmitted (25% of pairs, Fig. 3A). While there is a high degree of confidence in the result 132 when one particle is transmitted, there is often uncertainty around the exact number when multiple 133 particles are transmitted (Fig. 3A). Importantly, we found acute stage transmissions are less likely to lead 134 to single particle infections compared with chronic stage transmissions (27% vs. 80%, P = 0.0005). The 135 topology class of the simulated phylogenetic trees is strongly influenced by the number of virus particles 136 being transmitted (Fig. 3B). PM trees are more commonly found in the pairs that are better described by a 137 model with a single transmitted virus particle (81%) whereas PP trees appeared more often when multiple 138 particles are likely to have been transmitted (86%). Stratifying by risk group, we find there is a higher probability that one variant founds HET infections than 147 MSM infections (a geometric mean of 0.80 vs. 0.63, Fig. 4B). However, these risk group differences 148 mostly disappear when we stratify the results by the infection stage of the source. Here, for example, 149 when only chronic stage transmissions are considered, the probability of one founding variant is a little 150 higher for MSM transmissions than for HET transmissions (means of 0.80 vs 0.71), and the pairwise 151 diversity at transmission is similar between both groups (Fig. 4C). In contrast, when stratifying solely by 152 infection stage of the source partner, we find that transmission during the acute stage has a much lower 153 probability of one founder variant than during the chronic stage (means of 0.40 vs. 0.77) with a higher 154 median number of variants transmitted, when only the most likely multiplicity for each pair is considered 155 Fig. 4A). Nonetheless, if multiple variant transmission does occur, our results suggest that the 156 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint number of founder variants is higher during chronic stage transmission, consistent with a higher diversity 157 measure during this later stage of infection (Fig. 4C). Our results suggest that there is an association between tree topology class and multiple variant 178 transmission, with 95% of MM and PM trees being due to one founder variant (Fig. 4D). However, the 179 number of embedded recipient clades is not always a proxy for the minimum number of founder variants 180 transmitted. For example in chronic stage transmission, 11% of PP topology class trees were due to single 181 variant transmission (Fig. 4D). Across both infection stages, we find that if MM, PM or PP is assigned as 182 the most likely tree topology class, then 92%, 96% and 15% of transmissions are due to a single founder 183 variant, respectively. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint 199 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint with our results, with two recipients likely infected with one founder variant, one recipient with one to 224 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint two founder variants and one recipient with two to three variants. Small differences likely arise because 225 this study uses sequence data from both partners to evaluate the multiplicity of founder variants in the 226 recipient partner. These extra data can be used to parameterize a mathematical model that accounts for the 227 evolutionary relationship between the virus samples from both partners, rather than relying solely on 228 accumulating diversity. Specifically, neglecting the extent of genetic similarity between the source and 229 recipient virus samples might misattribute borderline cases of diversity accumulation.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint

Data collation on linked transmission pairs 250
We automatically retrieved all HIV sequence data for men-who-have-sex-with-men (MSM) and  (Fig. 1, Supplementary  261 Information). We excluded all transmission pairs from further analysis for whom these three times could 262 not be determined, for whom a risk group was not provided, for whom either partner has fewer than five 263 sequences for all sampling times, or for whom the pair did not have a LANL cluster ID. For our base case 264 analysis, we used the longest available genomic region with five or more sequences per partner. If more 265 than one sampling time is available for any of the individuals, we selected the sample closest in time to 266 the recipient transmission. 267 268

Empirical transmission pair analysis 269
Tree reconstruction: For each of the included transmission pairs, we generated posterior sets of 270 phylogenetic trees. For this, we first constructed alignments using Muscle v3.8.31 22,23 with subtype 271 specific reference sequences retrieved from the LANL HIV sequence database. Using these alignments, 272 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint we built phylogenetic trees with MrBayes 3.2.7 24,25 under the assumption of a general time-reversible 273 (GTR) nucleotide substitution model with the addition of invariant sites (I) and a gamma distribution of 274 site rates. We constrained sequence data to be monophyletic with respect to the reference sequences to 275 root the tree but ingroup relationships were unconstrained to avoid any topology class bias. We ran two 276 Markov chains each with 30 million iterations, from which we sampled every 3,000th after discarding the 277 first 50% as burn-in which provided an average standard deviation of split tree frequencies of below 0.01 278 or an effective sample size of greater than 300. This gave an empirical posterior distribution of N = 5,000 279 sample trees. In a sensitivity analysis, we tested the alternative method of using maximum likelihood 280 phylogenetic tree reconstruction with bootstrapping. accounted for the assumption that transmission rate may be higher during the acute stage, with half of the 296 index to source transmissions occurring after 90 days and the remaining half after three years, (ii) the 297 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101 source individual of the transmission pair, and (iii) finally the recipient individual of the transmission 298 pair. For each individual within each trio, we simulated viral phylogenies that reflect between-and 299 within-host viral evolution using VirusTreeSimulator 26 , using as input the respective epidemiological and 300 clinical information (Supplementary Information). We used a within-host effective population size 301 consistent with that parameterized by the PANGEA-HIV study with the following logistic model 302 parameters: initial effective population size (! ! ) is 1, viral generation time (!) is 1.8 days, effective 303 population per year growth rate (!) is 2.85022, and time to half the carrying capacity of the viral 304 population (! !" ) is 2 years 26 . For each transmission pair, we simulated a dated viral phylogeny that has 305 the same number of tips as the number of retrieved sequences per partner and that is sampled at the 306 respective sampling times for the source and recipient partner (Supplementary Information). For each 307 recipient partner infection, we assume that a total of ! ! virus particles founded the infection. For each 308 simulation, we further assume a total of ! ! virus particles founding infection of the source. We assume ! ! 309 takes values between one and a maximum of 12 and varied ! ! between one and two (Supplementary 310 Information). We assume that the virus samples from each recipient is representative of the within-host 311 diversity, and that each founding virus particle has an extant lineage. Therefore, we first assigned each 312 sample (tip) of a phylogeny as a descendant of one of the ! ! virus particles. If there were more than 12 313 samples then the remaining tips were assigned randomly to the ! ! = 12 virus particles. If there were 314 fewer than 12 samples, then we constrained the number of founding virus particles, ! ! , to equal the 315 number of samples. For every transmission pair, and for each value of ! ! and ! ! , we simulated 100 viral 316 phylogenies. 317 318 For every simulated viral phylogeny, we simulated transmitted sequences by adding dummy nodes with a 319 negligibly short branch length after the transmission time. We then simulated the evolution of nucleotide 320 sequences along the tree using Seq-Gen 27 and a GTR + I + gamma substitution model. The length of the 321 simulated sequences and the evolutionary tree scaling rate match each transmission pair's empirical 322 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review)  pair, !, we estimated the most likely number of viral particles founding each recipient infection ! R * as the 339 ! ! that maximises the multinomial likelihood function 340 . For each transmission pair !, we calculated 341 lower and upper confidence limits for ! R * as the minimum and maximum values of ! ! that satisfy 342 ! !,! ! > ! !,! R * − 1.92 and ! !,! ! < ! !,n R * + 1.92, respectively 31 . For each transmission pair !, we retain 343 the best fit model for further analysis such that there are ! R * viral particles founding infection of the 344 recipient. 345

Haplotype analysis 346
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi.org/10.1101/19013524 doi: medRxiv preprint model, we defined the random variables ! ! ! and ! ! ! as the number of haplotypes that found infection of 348 the source and the recipient partners, respectively. We then calculated the probability of there being a 349 single founder haplotype in the recipient, stratified by topology class of the simulated phylogenetic tree 350 (MM, PM, PP) and the number of founder haplotypes, i, in the source partner, ! ! ! (!), that is, ! ! ! ! = 351 Pr (! ! ! = 1|! ! ! = !, !) . Next, we defined the probability of a single founder haplotype in the recipient as 352 a function of a tree topology, t, ! ! ! = Pr ! ! ! = 1 ! = ! ! ! Pr ! ! = 1 + ! ! ! Pr (! ! > 1). By assuming 353 that the source partners are randomly selected from the general MSM or HET population in which the 354 probability of a single founder strain has been calculated to be approximately 0.7 19 , we set, Pr(! S = 1) = 355 0.7 and Pr ! S > 1 = 0.3. Finally, for each transmission pair, we calculated the probability of one 356 founder haplotype given the observed triplet of empirical posterior topology classes ! ! , as ! ! = 357

367
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. not certified by peer review) (which was The copyright holder for this preprint this version posted December 4, 2019. . https://doi. org/10.1101/19013524 doi: medRxiv preprint