Validation of molecular clock inferred HIV infection ages: Evidence for accurate estimation of infection dates.

BACKGROUND
Improving HIV diagnosis, access to care and effective antiretroviral treatment provides our global strategy to reduce HIV incidence. To reach this goal we need to increase our knowledge about local epidemics. HIV infection dates would be an important information towards this goal, but they are largely unknown. To date, methods to estimate the dates of HIV infection are based mainly on laboratory or molecular methods. Our aim was to validate molecular clock inferred infection dates that were estimated by analysing sequences from 145 people living with HIV (PLHIV) with known transmission dates (clinically estimated infection dates).


METHODS
All HIV sequences were obtained by Sanger sequencing and were previously found to belong to well-established molecular transmission clusters (MTCs).


RESULTS
Our analysis showed that the molecular clock inferred infection dates were correlated with the clinically estimated ones (Spearman's Correlation coefficient = 0.93, p < 0.001) and that there was an agreement between them (Lin's concordance correlation coefficient = 0.92, p < 0.001). For most cases (61.4%), the molecular clock inferred preceded the clinically estimated infection dates. The median difference between clinically and molecularly estimated dates of infection was of 0.18 (IQR: -0.21, 0.89) years. The lowest differences were identified in people who inject drugs of our study population.


CONCLUSIONS
The estimated time to more recent common ancestor (tMRCA) of nodes within clusters provides a reliable approximation of HIV infections for PLHIV infected within MTCs. Next-generation sequencing data and molecular clock estimates based on heterochronous sequences provide, probably, more reliable methods for inferring infection dates. However, since these data are not available in most of the HIV clinical laboratories, our approach, under specific conditions, can provide a reliable estimation of HIV infection dates and can be used for HIV public health interventions.


Introduction
HIV infection is still expanding around the world. According to WHO/UNAIDS, the number of people living with HIV (PLHIV) globally was 38 million in 2019. In the absence of an effective vaccine, the goal to reduce HIV incidence is based on the 90-90-90 target for diagnosis, access to care and effective antiretroviral (ARV) treatment. Knowledge about local HIV epidemics would facilitate the implementation and success of this ambitious global target (1). One of the critical epidemiological parameters is HIV incidence, but HIV infection dates are largely unknown. This is due to the lack of seronegative tests for the majority of the PLHIV, as well as the rare diagnosis during the acute phase of infection.
To date, the available methods to estimate HIV infection are based on serological or genomic-based approaches. Laboratory (serological) methods, such as the Limiting Antigen Avidity Assay (LAg) or the Calypte Incidence Assay (BED), can classify recent from nonrecent infections (2-4). Genomic-based or molecular methods can estimate the age of HIV infection based on previous observations that HIV genetic divergence increases over time.
Several methods have been described so far including high resolution melting assays (HMA), the degree of sequence ambiguities in Sanger population sequencing (5-7), molecular clockbased methods (8,9) and the Hamming distances between different quasispecies (7, 10). More recently, the utilization of next-generation sequencing (NGS), and specifically the deep NGS, provided a new advance for the estimation of the time since HIV infection (11-13).
Our aim was to investigate if time to the most recent common ancestor (t MRCA ) of nodes within well-established molecular transmission clusters (MTCs) can be used as a proxy for the HIV infection date, using Sanger sequencing data from densely sampled populations with known transmission dates (clinically estimated infection dates).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020

Materials and Methods
Our study population consisted of PLHIV with known transmission dates from the following datasets: i) a "seek-test-treat" intervention programme named "ARISTOTLE" implemented in Athens, Greece, during 2012Greece, during -2013, ii) a network-based intervention that targeted persons with recent HIV-infection ("TRIP") implemented in Athens, Greece, between 2013 and2015 (15), iii) PLHIV infected within a subtype A1 MTC in Greece (16,17), iv) People who inject drugs (PWID) infected locally during an HIV outbreak in Luxembourg since 2014 and v) PLHIV infected within a CRF42_BF MTC in Luxembourg (18) ( Table 1). HIV sequences were selected to belong to PWID-specific MTCs [datasets (i), (ii) and (iv)] and local MTCs [datasets (iii) and (v)] (19). Local MTCs were defined as phylogenetic clusters including ≥ 2 sequences from the same geographic area (Greece or Luxembourg) at a proportion greater than 70% compared to the total number of sequences within the cluster, as previously described in detail (16,20). All sequences were generated before the ARV treatment initiation. Each patient was specified by a unique serial number to avoid double-counting. Duplicate patients were detected and excluded from the study population.
Known transmission dates were available from: a) the midpoint between the last available seronegative and the first seropositive tests for HIV, the possible date of HIV acquisition based on the diagnosis of acute HIV infection and patient self-report [datasets (i), (iii), (iv) and (v)] (17), or b) the putative date of HIV infection which was determined using the LAg method [dataset (ii)] (2, 15). From now on we will refer at these dates as "clinically estimated infection dates". Molecular clock calculation using phylodynamic analysis was used for the reestimation of the infection dates of our study population (PLHIV with an available clinically estimated infection date). Phylodynamic analyses were applied on PR and partial RT sequences . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 generated by Sanger sequencing and implemented in BEAST (versions 1.8.0 and 2.1.3), using different nucleotide substitution models (GTR+G, HKY+G), uncorrelated lognormal relaxed clock models and the birth-death models (21,22). The full description of these analyses has been provided in detail elsewhere (17,19,23). The age of infection (infection dates) was approximated to the median estimate of the t MRCA of the internal nodes within each MTC. We will refer at these dates as "molecular clock inferred infection dates".
Medians, interquartile ranges (IQRs) and 95% confidence intervals (CIs) were used to summarize the data. Spearman's correlation coefficient was used to measure the strength and direction of association between the clinically estimated and molecular clock inferred infection dates, while Lin's concordance correlation coefficient was used to measure the agreement between these estimates. The fitted regression line was estimated by a quartile (median) regression model and was compared to the line of equality by testing the two-tailed hypothesis of slope = 1 and intercept = 0. In addition, the Bland-Altman plot was used to quantify and describe the agreement between the two variables and the presence of outliers. Spearman's correlation coefficient was also used to assess putative correlation between the difference of the clinically and molecularly estimated dates of infection and the time interval from the last seronegative to the first seropositive tests for HIV, for a subset of our study population (26 PLHIV from whom the clinically estimated infection date was represented by the midpoint between these tests). The level of significance was set at 0.05. Analyses were performed in Stata 13-StataCorp LP software.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint

Results
The total number of HIV sequences used in the context of our study was 145. These sequences were retrieved from five different datasets, as it was described in detail previously in the methods section. The samples from "ARISTOTLE" [dataset (i), n=23] and "TRIP" [dataset (ii), n=13] were merged since they were drawn from PWID participating in intervention programmes implemented in Athens, Greece, during the same time period (  (Figure 1). For most cases, the molecular clock inferred infection dates preceded the clinically estimated ones (89 of 145, 61.4%). Moreover, the fitted regression line was described by the equation: molecular clock inferred = -53.49 + 1.03 × clinically estimated, with 95% CI for the estimated intercept at (-138.49, 31.51) (p=0.216) and with 95% CI for the estimated slope at (0.98, 1.07) (p=0.0514) ( Figure 1). Lin's concordance correlation coefficient was 0.92 (95% CI: 0.90, 0.94) and showed an agreement between clinically and molecularly estimated infection dates (p<0.001).
The median difference between the clinically estimated and molecular clock inferred infection dates was 0.18 (IQR: -0.21, 0.89) years across all sequences (n=145) ( Table 2). The largest differences were found for the non-PWID infected with subtype A1 in Greece [dataset (iii), median value: 2.04 (IQR: 0.62, 2.99) years] and for the non-PWID of the CRF42_BF . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint clade in Luxembourg [dataset (v), median value: 0.52 (IQR: -0.16, 1.28) years] (Table 2).
Generally, for the PWID of our study population the differences were very low [datasets (i) & (ii): median value: 0.09 (IQR: -0.04, 0.36) years; dataset (iv): median value: -0.24 (IQR: -0.46, 0.19) years) ( Table 2). The differences between the clinically and molecularly estimated infection dates for all PLHIV of our study population were plotted against the mean obtained by the two variables ( Figure 2). The Bland-Altman plot revealed an agreement between the two variables, an absence of a consistent bias and the existence of a small number of outliers [6 of 145 (4.1%) PLHIV with a difference greater than 4 years] ( Figure 2). Moreover, the difference between the clinically estimated and molecular clock inferred infection dates was not correlated with the time interval from the last seronegative to the first seropositive tests (p=0.822).
The median estimate of the time interval between the molecular clock inferred infection dates and the sampling dates of the sequences was 0.59 (IQR: 0.25, 1.45) years ( . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint

Discussion
Knowledge of infection dates is valuable for estimating the number of new HIV infections per year, the improvement of linkage to care and also for tailoring efforts to control HIV transmission (24, 25). Our study provides evidence that, under specific conditions, HIV data generated by Sanger sequencing can be used for reliable estimation of HIV infection dates.
Specifically, the t MRCA of nodes within local MTCs can provide a proxy of the HIV infection dates. Our validation using sequences from PLHIV with previously known transmission dates (clinically estimated dates of infection) suggested that the difference between clinically estimated and molecular clock inferred infection dates was smaller for PLHIV infected within MTCs with short branches translated as short time intervals between sampling and infection.
On the other hand, this difference was not correlated with the time interval from the last seronegative to the first seropositive tests, suggesting that the discordance between the two infection dates was not due to a larger uncertainty of the clinically estimated dates of infection.
The molecular clock inferred age of infection preceded the clinically estimated one for most PLHIV due to the characteristics of HIV transmission. Specifically, the t MRCA of an internal node of viral sequences from two persons (i.e., the source and the recipient) corresponds to the coalescent time of the viral lineages in the source (8). Therefore, the molecular clock inferred dates of infection can precede the clinically estimated dates due to the existing genetic divergence at the time of infection (pre-transmission interval) (8,9). In our case, only a single date was estimated for both the source and the recipient and thus the accuracy of the infection ages depends also on the serial interval (time between two HIV cases in a transmission chain).
This probably explains why the accuracy is higher for PWID, since in this case a large number of transmissions occurred within a short time frame.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10. 1101/2020 To date, molecular markers, such as the low proportion of ambiguity sites in partial HIV-1 sequences, have been shown to provide a reliable marker of recent transmission (5-7).
Kouyos et al suggested that the proportion of ambiguity sites increased by 0.2% within the first 8 years and a proportion >0.5% is a marker of a non-recent infection (5). Earlier studies have shown that molecular clock analyses using multiple sequences per patient can reliably estimate HIV-1 infection dates (8,9). More recently, Puller et al showed that within patient viral diversity estimated using NGS data can be used for the accurate estimation of infection dates (11). The accuracy of the estimations was improved by increasing the sequence depth and particularly viral diversity in the 3 rd codon position increased linearly with time (11). NGS data are superior to the fraction of ambiguous sites or Sanger sequencing (11,12). Moreover, NGS does not require heterochronous sampling and this provides an additional strength of this method.
In the current study, we show that t MRCA estimations of the infection ages based on Sanger sequencing data can be used reliably as an approximation of HIV infection dates. Our approach has a few limitations, such as that it can be utilized only for sequences from PLHIV infected with MTCs. Our approach is probably less reliable than NGS, however NGS data or heterochronous sequences are largely missing from HIV clinical laboratories. Alternatively, and under specific conditions, widely available HIV-1 Sanger sequencing data can be used for the estimation of HIV infection dates. Finally, if MTCs have been characterized, our study suggests that dating can be used for an on-going epidemic to quickly assess whether a given blood sample represents a recent infection. This is a useful adjunct to LAg for HIV public health interventions in accordance with one of the key strategies to respond quickly to potential HIV outbreaks as part of the Ending HIV plan in the USA.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint

Funding
The study has been supported in part by the following grants: i) "ARISTOTLE" program was

Competing financial interests
The authors declare no competing financial interests.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.11.20247601 doi: medRxiv preprint   C  i  r  c  u  l  a  t  i  n  g  R  e  c  o  m  b  i  n  a  n  t  F  o  r  m  4  2  (  C  R  F  4  2  _  B  F  )  i  n  L  u  x  e  m  b  o  u  r  g  .  A  I  D  S  R  e  s  H  u  m  R  e  t  r  o  v  i  r  u  s  e  s  .   2  0  1  5  ;  3  1  (  5  )  :  5  5  4  -8  .   1  9  .  K  o  s  t  a  k  i  E  ,  M  a  g  i  o  r  k  i  n  i  s  G  ,  P  s  i  c  h  o  g  i  o  u  M  ,  F  l  a  m  p  o  u  r  i  s  A  ,  I  l  i  o  p  o  u  l  o  s  P  ,  P  a  p  a  c  h  r  i  s  t  o  u  E  ,  e  t  a  l  .   D  e  t  a  i  l  e  d  M  o  l  e  c  u  l  a  r  S  u  r  v  e  i  l  l  a  n  c  e  o  f  t  h  e  H  I  V  -1  O  u  t  b  r  e  a  k  A  m  o  n  g  P  e  o  p  l  e  w  h  o  I  n  j  e  c  t  D  r  u  g  s  (  P  W  I  D  ) i n . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020