Analytic optimization of Plasmodium falciparum marker gene haplotype recovery from amplicon deep sequencing of complex mixtures

Molecular epidemiologic studies of malaria parasites commonly employ amplicon deep sequencing (AmpSeq) of marker genes derived from dried blood spots (DBS) to answer public health questions related to topics such as transmission and drug resistance. As these methods are increasingly employed to inform direct public health action, it is important to rigorously evaluate the risk of false positive and false negative haplotypes derived from clinically-relevant sample types. We performed a control experiment evaluating haplotype recovery from AmpSeq of 5 marker genes (ama1, csp, msp7, sera2, and trap) from DBS containing mixtures of DNA from 1 to 10 known P. falciparum reference strains across 3 parasite densities in triplicate (n=270 samples). While false positive haplotypes were present across all parasite densities and mixtures, we optimized censoring criteria to remove 83% (148/179) of false positives while removing only 8% (67/859) of true positives. Post-censoring, the median pairwise Jaccard distance between replicates was 0.83. We failed to recover 35% (477/1365) of haplotypes expected to be present in the sample. Haplotypes were more likely to be missed in low-density samples with <1.5 genomes/μL (OR: 3.88, CI: 1.82–8.27, vs. high-density samples with ≥75 genomes/μL) and in samples with lower read depth (OR per 10,000 reads: 0.61, CI: 0.54–0.69). Furthermore, minority haplotypes within a sample were more likely to be missed than dominant haplotypes (OR per 0.01 increase in proportion: 0.96, CI: 0.96–0.97). Finally, in clinical samples the percent concordance across markers for multiplicity of infection ranged from 40%–80%. Taken together, our observations indicate that, with sufficient read depth, haplotypes can be successfully recovered from DBS while limiting the false positive rate.


Introduction
Malaria parasite surveillance and molecular epidemiologic studies increasingly employ as a genotyping approach amplicon deep sequencing (AmpSeq) of short polymorphic fragments of parasite DNA.Depending on which segments of the genome are sequenced, this approach returns haplotypes that can be used to estimate complexity of infection (1), investigate transmission between hosts (2,3), evaluate the prevalence and incidence of markers of drug resistance (4)(5)(6), and classify recurrent infections following drug treatment as reinfections or recrudescences (7,8).As a result of these diverse use cases of AmpSeq, there is a broad need for practical and empirically-derived approaches to maximize haplotype recovery and mitigate the risks of false genotypes.
Prior groups have evaluated the accuracy of haplotype recovery from mixtures containing DNA from known P. falciparum strains across a range of available tools and parameters, and reported that strains present in low proportions are likely to be missed (2,9) and that false positive haplotypes often have lower read depth (10,11).In a large analysis of complex mixtures of up to five reference strains, recovery of two markers was compared using four haplotype calling tools (11).They found that fewer haplotypes are recovered from samples with less P. falciparum DNA, that haplotypes with a lower read count were more frequently false positives, and that the four different haplotype calling tools performed similarly.What remains unexplored by these prior reports are investigations of haplotype recovery from samples with three key features of field studies: i) prepared and processed as dried blood spots (DBS), ii) present across a range of densities reflective of infections that are typically observed in field studies, and iii) harboring genomes from a large range of P. falciparum strains, which in natural infections can exceed 15 (2).
. We evaluated the accuracy of the recovery of diverse P. falciparum haplotypes from DBS harboring simple and complex mixtures of parasite genomes.To do so, we prepared mixtures of up to 10 parasite strains at known proportions and across three parasite density categories, and amplified and sequenced each in triplicate with MiSeq across polymorphic segments of five distinct markers (ama1, csp, msp7, sera2, and trap).With these reads, we investigated the influence of parasite density, genomic complexity, and haplotype censoring criteria on the removal of false positive haplotypes, the sensitivity and precision of haplotype discovery, interreplicate variability, and the ability to recover expected haplotypes at each locus. .
. Five mock polygenomic infections and a 3D7-only mock infection were created by making control mixtures that combined 1 ng/µl gDNA stocks of the distinct parasite reference strains in known percentages ranging from 1% to 100% (Figure 1A).Each control mixture was serially diluted in uninfected whole blood, and dried blood spots (DBSs) were made for each of the 11 dilutions per mixture.DBS were singly punched into individual wells of a deep 96-well plate, and a modified Chelex-100 protocol (3) was used to make gDNA extracts.These were then tested in duplicate with a duplex pfr364/human β-tubulin quantitative PCR (qPCR) assay that estimated parasite densities using a standard curve generated with extracts from control DBS at dilutions of P. falciparum 3D7 ranging from 0.1 to 2000 parasites/µL of whole blood (14).Control mixture extracts were assigned to one of three parasite density ranges (low, <1.5 genomes/µl; medium, 1.5-75 genomes/µl; and high, ≥75 genomes/µl) and pooled by mixture at each density range for a total 18 pools (6 mixtures x 3 densities) to be used as templates for subsequent PCR amplification.

Library preparation and sequencing
Each mixture template was prepared for sequencing according to qPCR Ct-value as described in (3).Then, from each mixture template, we amplified at the target segments of ama1, csp, msp7, sera2, and trap in individual reactions in triplicate using a nested PCR strategy.Library preparation for sequencing followed described methods (15) with the following exceptions: PCR1 reactions included 300nM of each primer and

Haplotype recovery
We used Snakemake v 7.20.0(16) to build an integrated pipeline for haplotype recovery, BRAVA (Basic and Rigorous Amplicon Variant Analyzer; https://github.com/duke-malariacollaboratory/BRAVA) in order to trim, filter, and map reads, and thence call haplotypes.Primers and adapters of amplicon deep sequencing reads for each marker were removed using Cutadapt v4.1 (17).These reads were trimmed using Trimmomatic v0.38 (18); this removed the leading and trailing bases below a Phred quality score of 10, removed all nucleotides from the 3' end after the quality of the read falls below an average Phred quality score of 15 over a sliding window of 4 nucleotides, and dropped reads with fewer than 80 nucleotides.Remaining reads were mapped to the 3D7 reference genome using BBmap v39.01 (19).Reads were then further filtered and trimmed using the R package DADA2 v1.20.0 (20) function filterAndTrim with a maximum number of expected errors (maxEE) equal to 1. Values ranging from 2 to 10 were tested for the truncQ parameter in filterAndTrim, which truncates reads at the first instance of a quality score ≤truncQ.The optimal value was determined to be the value that maximized the number of reads used for haplotype calling (10); the haplotypes that were output when using this value of truncQ were used for all subsequent analyses.Next, the learnErrors function was used to learn error rates, the dada function was used to remove sequencing errors and identify haplotypes, and the removeBimeraDenovo function was used to remove chimeras.All haplotypes returned by DADA2 were included for analysis.

Categorization of haplotypes
For each locus in each sample, we categorized each haplotype returned by DADA2 into one of three groups: . 1. Expected haplotype: A haplotype with an identical sequence to that of a template sequence expected to be observed in the sequenced library.These were considered true positive haplotypes.
2. Haplotype arising from systematic error: A haplotype with a sequence or read depth that we did not expect to observe in the sequenced library, but which was observed across all three replicates for at least one density bin.These were suspected to be truly present owing to either inadvertent introduction to mixtures during gDNA preparation or the presence of multiple haplotypes in the original source gDNA.Haplotypes arising from systematic error were removed from the analysis prior to screening for optimal thresholds for haplotype censoring, as we suspected that these template strains were truly present in the library that was sequenced and therefore shouldn't be expected to be corrected by applying filtering criteria.
3. Haplotype arising from random error: A haplotype that we did not expect to observe in the sequenced library, and that was not consistently present across replicates for any mixturedensity combination.These were considered false positive haplotypes.

Identification of optimal thresholds for haplotype censoring
We evaluated the efficacy of four common metrics used to censor haplotypes: i) the depth of reads within a sample supporting a haplotype (read depth), ii) the proportion of reads within a sample supporting a haplotype (read proportion), iii) the ratio of abundances of pairs of haplotypes within a sample with a Hamming distance of one (read ratio), and iv) the length difference of the returned haplotype relative to that of the expected reference strain (length difference).As mentioned above, haplotypes arising from systematic error were removed prior to evaluating these criteria.All reference strain haplotypes for all loci were identical in length to the 3D7 haplotype, except one msp7 haplotype that was 3 base pairs shorter.Thus, we defined this censoring criterion as follows: the difference in length between the observed haplotype and .
the 3D7 reference haplotype must be equal to 0, -3, or 3 (i.e. one codon may be inserted or deleted).For the other 3 censoring criteria, we used Youden's J statistic to identify optimal thresholds across all possible thresholds of the criterion and corresponding confidence intervals with the coords and ci.coords functions from the R package pROC v1.18.0 (21).Because the importance of retaining true positive haplotypes vs. removing false positive haplotypes varies depending on the use case, this statistic was computed using three different ways to weight false negative vs. false positive classifications: equal weight to false negatives and false positives, 2x the weight to false negatives, and 2x the weight to false positives.To evaluate censoring criteria, we used the optimal criteria based on false negatives having 2x the cost of false positives.

Risk factor analysis for missing haplotypes
We performed a bivariate and multivariate logistic regression to investigate risk factors for haplotype missingness in R using the glmer function in lme4 v1.1.32(22).Missing haplotypes were defined as those that were not observed in the sample prior to the application of any haplotype censoring criteria.The outcome was the presence or absence of the haplotype in the un-censored haplotypes, and covariates were target, starting proportion of the reference template strain, read depth (per 10,000 reads), parasite density, and expected number of distinct haplotypes present in the sample.A random intercept was included for each mixturedensity combination.Low-density mix C samples were excluded from this analysis as they exhibited signatures of contamination from a high-density sample.

Clinical sample analysis
Ten P. falciparum-positive DBS collected in a field study in Webuye, Kenya that were previously sequenced at the ama1 and csp loci (2) were sequenced at the msp7, sera2, and trap loci.
These samples were selected from those that were high-density and had MOIs >1 at both ama1 and csp loci (using previously defined haplotype calls and censoring criteria (2)).Haplotypes for newly sequenced loci were called with the pipeline described above, using the same method as for ama1 and csp.All haplotypes were censored using the identified optimal censoring criteria.

Ethical statement
The field study in which the clinical samples were collected was approved by institutional review boards of Moi University (2017/36) and Duke University (Pro00082000).All participants or guardians provided written informed consent, and those over age 8 years provided additional assent.

Data analysis and visualization
All data were analyzed and visualized using R v4. .

Mixtures, reference strains, deep sequencing, and haplotype calling
We sequenced five previously-developed AmpSeq marker genes: ama1, csp, msp7, sera2, and trap (Table 2), and generated for sequencing 6 mock infections harboring mixtures of gDNA from between 1 and 10 distinct parasite reference strains (Figure 1A) to approximate the polygenomic nature of many infections in high-transmission areas.Not all marker genes were unique to a strain; a total of 37 distinct haplotypes were present across the 10 strains and 5 markers.Pairwise single nucleotide variant (SNV) distance varied between strains and markers (median: 4, range: 0-15; Figure 1B).
For each marker gene, each of the 6 mixtures was sequenced from dilution pools corresponding to low (<1.5 genomes/µL), medium (1.5-75 genomes/µL) and high (≥75 genomes/µL) parasite density bins in triplicate, tallying to 1365 expected haplotype occurrences across 270 sequenced samples.We obtained analyzable reads for 257/270 samples, with differences in the absolute yield of read counts between low (4.3 million), medium (8.0 million), and high (10.2 million) density samples.This general observation held for each individual marker, save for trap and msp7 which returned moderate read amounts irrespective of parasite density bin (Figure 1C).Overall, we observed across the five loci and 257 samples 1292 haplotype occurrences (Figure 1D), for which the median read depth was 1542.other markers so read depth cannot be directly compared between ama1 and other markers.

False positive haplotypes
We first investigated false positive haplotype occurrences across samples.Within each sample, we categorized each observed haplotype as expected to be present in the sample (true positive, n=859/1292, 66%), likely cryptically present in the original mixture (systematic error; n=254/1292, 20%), or likely arising from random error (false positive, n=179/1292, 14%) (Figure 2A).Only 1% of reads that passed filtering supported haplotypes that were categorized as false positives.We observed this trend of proportionately few reads supporting proportionately more false positive haplotypes across both markers and parasite density bins (Figure 2B).
Furthermore, the percentage of false positive haplotypes was relatively similar across parasite density bin (12-16%), although for ama1 and sera2, there were fewer false positive haplotypes for low-density templates (Figure 2C).False positive haplotypes were often not the correct sequence length, were often only one nucleotide different from a reference sequence in the sample (Figure 2C), and had lower read depths than haplotypes we expected to observe (median=104 vs. 2393, Wilcox p < 0.001; Figure 2D). .

Evaluating haplotype censoring criteria
We next evaluated, in our dataset, the effectiveness of four important threshold criteria typically applied to remove false positives from AmpSeq data: read depth, read proportion, read ratio of .similar haplotypes, and haplotype length.The optimal thresholds had large confidence intervals and varied depending on how much weight was given to false positive vs. false negative haplotypes (Figure 3A-C 3D-E).Of the 179 random error haplotypes, 75% fell under the read threshold, 54% fell under the proportion threshold, 30% fell under the within-sample ratio threshold, and 28% had a length different than the reference strains.Furthermore, for all markers but trap, fewer false positive haplotypes were successfully censored in lower parasite density bins (Fisher's exact p < 0.01, Figure 3F), yielding more false positives post-censoring in low-( 11) compared to medium-( 6) and high-density (0) parasite bins.Of the censored true positive haplotypes, over half (39/67; 58%) were from high-density templates, and only 5/67 (7%) made up ≥10% of the original mixture (Figure 3G). .

Inter-replicate variability
To evaluate the consistency with which haplotypes were returned, we measured inter-replicate variability post-censoring.Overall, 58% of haplotypes were observed in all 3 replicates, 18% in 2 replicates, and 24% in 1 replicate.Haplotypes were more consistently returned in all three replicates for high-density samples (76% of the time) compared to medium-(61% of the time) and low-density samples (30% of the time) (Figure 4A).Consistent with this, in high-density samples Jaccard distances between replicates were higher (median = 1, IQR = 0.2) compared to medium-(median = 0.83, IQR = 0.5) and low-density samples (median = 0.5, IQR = 0.75) (Figure 4B).

Missing haplotypes
Of the 1365 haplotype occurrences expected to be present across all samples, we did not recover 477 (35%).Thus, we next investigated factors associated with missing haplotypes.As expected, haplotype proportion within a sample was inversely associated with missingness, with each increase of 0.01 in proportion associated with a 4% reduction in the likelihood of being missed (OR: 0.96, 95% CI: 0.96-0.97),even when controlling for marker, density bin, number of reads in the sample, and expected number of haplotypes (Figure 5A; Table 3).Additionally, for all markers except trap, <15% of haplotypes were missed from high-density samples, while >45% were missed from low-density samples (Fig 5A).Overall read depth for a sample was negatively correlated with the proportion of haplotypes that were missing from the sample .(Spearman's rho = -0.62; Figure 5B).Furthermore, within a sample, observed and expected read proportions were correlated, although there was high stochasticity, particularly for the lowdensity samples (Figure 5C).Finally, in high-density samples only 30/166 (18%) haplotypes were not recovered in any replicates, while in low-density samples 78/158 (49%) were not recovered in any replicates (Figure 5D).for low-density samples vs. -1 for medium-and high-samples, Wilcox p < 0.001; Figure 6A).
We performed a similar comparison using 10 high-density P. falciparum infections collected as DBS through a recent field study in Western Kenya, in order to capture a broader naturallyoccurring diversity of marker haplotypes (2).Using the optimal censoring criteria defined above, we observed 142 haplotypes across all samples and markers, of which 36 (25%) were censored.The range of MOIs was 1-5 for each marker.No marker consistently estimated the highest or lowest MOI, and the percent concordance ranged from 40% to 80% (Figure 6B).
. Percent concordance was computed for each sample as the percentage of markers for which the estimated MOI was equal to the mode MOI. .

Discussion
AmpSeq is an increasingly popular tool for molecular epidemiologic studies of P. falciparum collected on DBS, which necessitates rigorous haplotype recovery from field samples.We prepared DBS containing mixtures of gDNA from reference P. falciparum strains, amplified and sequenced polymorphic segments of 5 common marker genes in triplicate, and quantified the performance of haplotype recovery using a range of metrics.We observed that high sample read depth was associated with enhanced recovery of most haplotypes present in the original sample, and that censoring criteria based on read depth, read proportion, read ratio, and haplotype length can effectively remove most false positive haplotypes while retaining most true positive haplotypes.Thus, for use-cases which involve high-density samples or samples sequenced at high read depth, rigorous recovery can be achieved for multiple markers.
Consistent with prior studies (2,9,11), we observed that the likelihood of haplotype recovery is enhanced by higher parasite density and by a larger proportion of an individual haplotype within a mixture.In particular, the consistency with which we observed haplotypes across replicates was higher in high-density samples compared to low-density samples.However, we further observed that, independent of parasite density and reference haplotype proportion, successful haplotype recovery was further associated with a higher overall sample read depth.The ability to recover haplotypes constituting a minority population within a parasitemia with an overall low density is an important goal for many use cases of AmpSeq.Namely, therapeutic efficacy studies of antimalarials use active case detection to screen for recurrence of parasites, and frequently capture low-density infections with multiple strains which must then be compared to those in the initial infection in order to distinguish reinfection from recrudescent infection (7).
Additionally, studies of transmission networks in highly endemic settings in which low-density, asymptomatic infections predominate also benefit from comprehensive profiling of strains within mixtures in order to ascertain parasite relatedness between hosts (2).In these and similar use .
cases, the likelihood of detecting minority haplotypes can be improved by maximizing persample read depth, such as by limiting multiplexing and selecting maximal sequencing platform output.
We observed very different optimal censoring thresholds depending on how we weighted the relative importance of false positive and false negative haplotypes, which highlights the need to select censoring criteria suitable for the primary study objective.Penalizing false negative haplotypes more than false positive haplotypes yielded haplotype censoring criteria that still managed to remove most false positive haplotypes while retaining high sensitivity.Furthermore, these criteria were consistent with thresholds that others have used and reported in the literature (read depth: 204-420, read proportion: 0.005-0.014,read ratio: 0.09-0.36)(2,11).
We observed inconsistency in performance between markers with respect to false positives, censoring, missingness, and MOI.Pre-censoring, false-positive haplotypes were rarely recovered for msp7 but common for ama1 and trap.However, post-censoring the number of false positives was relatively low for all markers but trap.Fewer haplotypes were recovered for sera2 and trap overall.Furthermore, there was no consistent trend across a limited set of clinical samples of marker-specific MOI, suggesting that MOI estimates based upon a single marker may frequently underestimate the true MOI of a sample, as previously described (11).
Since most markers returned largely correct haplotype calls across a range of mixtures and parasite density bins, choice of marker may depend not only on marker performance but also other factors such as the biological question of interest (e.g.transmission, vaccine development, etc.).
Despite controlled laboratory conditions, we observed signatures of both systematic and random error.Systematic error may have resulted from two different sources.First, it is possible .that multiple haplotypes were present in the original template strains and were missed during Sanger sequencing.Second, systematic error could arise from contamination during gDNA extraction.Owing to the high-throughput manner of DBS processing, using 96-well plates, it is unfortunate but expected that we observe contamination in a small minority of samples included in a sequencing run.This highlights the importance of meticulous laboratory work and thoughtful controls, particularly because these haplotypes are less likely to be removed by censoring criteria owing to their presence in the original template.In contrast, random error may arise due to PCR stochasticity and polymerase error in low-input next-generation sequencing libraries (39).This is also inevitable, and the censoring criteria described here successfully removed many haplotypes arising from these technical errors.
Our study had several limitations.First, we created the mixtures from gDNA rather than from intracellular DNA; therefore, the composition of the solution from which DNA was amplified was slightly less complex than that from clinical samples.However, as we extracted DNA from DBS, our results provide a closer approximation to clinical samples than previous studies.Second, we did not attempt to censor haplotypes arising from systematic error because the commonly used censoring criteria assessed here assume that false positive haplotypes arise from random rather than systematic error.Third, this study focused on in silico recovery of haplotypes, and replicates were drawn from the same gDNA extract pools.Thus, variability occurring due to extraction is not accounted for in these data.However, our results provide useful insight into variation and random errors occurring at the amplification and sequencing steps.

Conclusions
We observed that P. falciparum haplotypes from multiple different targets can be successfully recovered from DBS, that in the majority of cases these haplotypes are recovered across .

Figure 1 :
Figure 1: Overview of mixtures, reference strains, and sequence yield.(A) Overview of mixtures A through F, each composed of various proportions of gDNA from the listed P.

Figure 2 :
Figure 2: Overview of false positive haplotypes.(A) Sample-level haplotype overview.Stacked boxes in each column represent observed haplotypes from reads that passed filtering,

Figure 3 :
Figure 3: Optimization and application of censoring criteria.(A-C) Sensitivity and specificity

Figure 4 :
Figure 4: Inter-replicate variability.(A) Number of replicates in which each haplotype was

Figure 5 :
Figure 5: Summary of missing haplotypes.(A) Numbers of missing haplotypes (light grey), observed but censored haplotypes (medium grey), and observed haplotypes (dark grey) in

Figure 6 :
Figure 6: Estimated multiplicities of infection (MOIs) based on each marker haplotype.(A) Observed minus expected MOIs for mixtures post-censoring.(B) Estimated MOIs in clinical

Table 2 : Marker gene characteristics
). Prioritizing the inclusion of true positive haplotypes over the removal

Table 3 : Risk factors for haplotype missingness
We next compared the expected multiplicity of infection (MOI) to the observed MOI after censoring, with MOI expressed as the number of haplotypes observed at each individual marker.Relative to the expected MOIs, the observed MOIs were equal 29% (74/254) of the time, lower 61% (154/254) of the time, and higher only 10% (26/254) of the time.MOIs were more likely to be underestimated in low-density samples (median observed-expected MOI = -4 .