Abstract
Malaria, predominantly caused by Plasmodium falciparum, poses one of largest and most durable health threats in the world. Previously, simplistic regression-based models have been created to characterize malaria infections, though these models often only include a couple genetic factors. Specifically, the Baker et al., 2005 model uses two types of particular repeats in histidine-rich protein 2 (PfHRP2) to assert P. falciparum infection [1], though the efficacy of this model has waned over recent years due to genetic mutations in the parasite.
In this work, we use a dataset of 406 P. falciparum PfHRP2 genetic sequences collected in Ethiopia and derived a larger set of motif repeat matches for use in generating a series of diagnostic machine learning models. Here we show that the usage of additional and different motif repeats proves effective in predicting infection. Furthermore, we use machine learning model explanability methods to highlight which of the repeat types are most important, thereby suggesting potential targets for future versions of rapid diagnostic tests.
1 Introduction
Malaria infects over 228 million people and resulted in 405,000 deaths in 2018 [2]. Genomics is beginning to bear fruit in abatement of malaria but presents analytical challenges due to the complexity of the disease and its components (human, Plasmodium spp., and vector mosquitos).
In most developing countries, the detection and diagnosis of malaria infections is often performed using simple rapid diagnostic tests (RDTs). Specifically, these tests are lateral flow immuno-chromatographic antigen detection tests that are similar in modality to common at-home pregnancy tests. These tests use dye-labeled antibodies to bind to a particular parasite antigen and display a line on a test strip if the antibodies bind to the antigen of interest [3]. If patients are properly diagnosed, P. falciparum infections are can treated using the drug artemisinin. Unfortunately, the efficacy of RDTs and artemisinin treatment are both waning. Our purpose is to use large datasets and machine learning methods to address the shortcomings in malaria diagnosis.
In 2005, Baker et al. published a simple linear regression-based model that purports to predict the detection sensitivity of RDTs using a small fraction of genetic sequence variants that code for histidine-rich protein 2 (PfHRP2) [1]. While with the data available at the time, the accuracy of the Baker model was high (87.5%), the explanation ability of the RDT sensitivity was low (R2 = 0.353). Enthusiasm for the Baker model has since diminished. In 2010, Baker et al. published a report in which they concluded that they can no longer correlate sequence variation and RDT failure with their model [4].
Nevertheless, there is no alternative to the Baker model and it is still in use. In this study, our hypothesis is that a model for understanding the relationship between RDT and sequence variation can be improved by using a larger set of genetic sequence variants. In this study, we analyze a collection of genetic data and metadata from 406 P. falciparum infections in Ethiopia with the Baker model along with a sweep of other machine learning models that we generate.
Beyond simply training a better model using more sophisticated algorithms, our research focus is to allow for interpretable insights of the machine learning models to be derived from the “black box”. We have shown previous success in AI-driven explanations of gene expression underlying drug resistant strains of Plasmodium falciparum [5, 6]. We apply this model interpretability here to identify which types of histidine-rich repeats in PfHRP2 are most indicative of malaria infection.
2 Materials and Methods
2.1 Data Collection
Blood samples and demographic data were collected from suspected malaria patients greater than five years of age in various health clinics during both the low and high transmission seasons in different regions of Assosa, Ethiopia. Microscopy and rapid diagnostic testing were performed within the health clinics, and drops of blood spotted on Whatman 3MM filter paper were kept in sealed pouches for later analyses. CareStartTM malaria combination RDTs (lot code 18H61 from Access Bio Ethiopia) were used to diagnose P. falciparum and to evaluate their performance against microscopy as a reference test.
The P. falciparum DNA concentration in dried blood spot samples was analyzed using real-time quantitative PCR (RT-PCR). The P. falciparum DNA was extracted using phosphate buffered saline, Saponin, and Chelex [7] and confirmed P. falciparum positive samples as those whose RT-PCR values were less than or equal to 37 [8]. The null hypothesis was that RDT testing and the detection of P. falciparum by RT-PCR will have a strong correlation (e.g., positive RDT samples will lead to positive RT-PCR and negative RDT samples will lead to negative RT-PCR). However, early findings have shown incongruence between the RDT results and RT-PCR [9].
Using the primers listed in Table 1, two amplicons were sequenced, including a 600 to 960-bp fragment for Pfhrp2 Exon 2 [1] and a 294 to 552-bp fragment for Pfhrp3 Exon 2 [4]. Polymerase Chain Reaction (PCR) conditions for Pfhrp2 Exon 2 and Pfhrp3 Exon 2 are shown in Table 1. The DNA amplicon quality was observed by means of agarose gel electrophoresis and the bands visualized in a UV transilluminator. PCR products were cleaned with 10 units of Exonuclease I (Thermo Scientific) and 0.5 units of shrimp alkaline phosphatase (Affymetrix) at 37 °C for 1 h followed by a 15 min incubation at 65 °C to deactivate the enzymes. PCR products were sequenced with ABI BigDye Terminator v3.1 (Thermo Fisher Scientific) following the manufacturer's protocol using the conditions of (1) 95 °C for 10 s, (2) 95 °C for 10 s, (3) 51 °C for 5 s, (4) 60 °C for 4 min, and (5) repeat steps 2-4 for 39 more cycles. The samples were cleaned using Sephadex G-50 (Sigma-Aldrich) medium in a filter plate and centrifuged in a vacufuge to decant.
The samples were reconstituted with Hi-Di Formamide (Thermo Fisher Scientific) and the plates were placed on the ABI 3130 Sequencer. Sequence trace files from all samples and repeat samples were imported into CodonCode Aligner (CodonCode Corporation). The bases were called for each sample. The ends of the sequences were trimmed by the application when possible and manually when necessary. All sequences were examined and evaluated on both the forward and reverse strands, with manual base corrections and manual base calls occurring when necessary.
2.2 Data Preparation
All Pfhrp2 exon 2 nucleotide sequences were exported from CodonCode Aligner (CodonCode Corporation) and individually pasted into the ExPASy Translate tool (Swiss Institute of Bioinformatics Resource Portal). Both forward and reverse DNA strands were translated using the standard NCBI genetic code. The six reading frames of the amino acid sequence produced were examined.
For each nucleotide sequence, the amino acid sequence presenting the fewest number of stop codons was selected for further analysis. If two or more of the reading frames appeared to produce sequences with an equally minimal number of stop codons, the reading frame that produced a sequence exhibiting the previously recognized pattern in prior sequences was selected for further analysis. While most of the sequences had a clear, single best translation, 11 of the sequences required further editing. In these 11 sequences, the sequence portion before or after the stop codon which exhibited a pattern similar to prior sequences was used in analysis, while the portion of the sequence preceding or following the stop codon, which did not exhibit the recognized pattern, was discarded. Nucleotide sequence input into the ExPASy Translate Tool (Swiss Institute of Bioinformatics Resource Portal) was repeated and verified for accuracy of amino acid sequences. The verified sequences were compiled.
2.2.1 Motif Search
A motif search was performed across 24 different types of histidine-based repeats. These repeat types, listed in Table 3, were originally defined by Baker et al, (2010) [4]. This search was completed using the motif.find() function in the bio3d package in R [11]. Specifically, each amino acid sequence was searched for each of the 24 repeat motifs and the count of matches was reported back into the data. See Table 2. The breakdown of match frequencies by location is shown in Figure 4.
2.3 Machine Learning
In this work, three machine learning experiments were created on different sets of features: 1.) using only the types that are in the original Baker model (Types 2 and 7), 2.) using all motif repeat type counts (Types 1 through 24), and 3.) using only the features found to be important in the experiment with all motif repeat types (Types 3, 5, and 10). Note that the PfHRP2 column in Table 2 is treated as the dependent variable in which “1” represents a positive case of malaria and “2” represents a negative case of malaria.
We used the Microsoft Azure Machine Learning Service [12] as the tracking platform for retaining model performance metrics as the various models were generated. For this use case, multiple machine learning models were trained using various scaling techniques and algorithms. Scaling and normalization methods are shown in Table 5. We then created two ensemble models of the individual models using stack ensemble and voting ensemble methods.
The Microsoft AutoML package [13] allows for the parallel creation and testing of various models, fitting based on a primary metric. For this use case, models were trained using Decision Tree, Elastic Net, Extreme Random Tree, Gradient Boosting, Lasso Lars, LightGBM, RandomForest, and Stochastic Gradient Decent algorithms along with various scaling methods from Maximum Absolute Scaler, Min/Max Scaler, Principal Component Analysis, Robust Scaler, Sparse Normalizer, Standard Scale Wrapper, Truncated Singular Value Decomposition Wrapper (as defined in Table 5). All of the machine learning algorithms are from the scikit-learn package [14] except for LightGBM, which is from the LightGBM package [15]. The settings for the model sweep are defined in Table 4.
For the experiment using only Types 2 and 7, 35 models were trained. For the experiment using Types 1 through 24, 35 models were trained. For the experiments using Types 3, 5, and 10, 31 models were trained.
Once the individual models were trained, two ensemble models (voting ensemble and stack ensemble) were then created and tested for each experiment. The voting ensemble method makes a prediction based on the weighted average of the previous models’ predicted classification outputs whereas the stacking ensemble method combines the previous models and trains a meta-model using the elastic net algorithm based on the output from the previous models. The model selection method used was the Caruana ensemble selection algorithm [17].
3 Results
Metrics from the three experiments' machine learning models (one each for the best ensemble model and a best singular model) are reported in Table 6. The precision-recall curves for these models are shown in Table 8 and the receiver operating characteristic (ROC) curves are shown in Table 7. The ideal scenario is shown as a dash-dot-dash (-·-) line. The best model overall is the Extreme Random Trees model using only Types 3, 5, and 10. This was determined by looking at the overall model metrics and the generated curves. Note that many models were generated for each experiment, some of which has equal overall performed. All model runs can be found in the Supplementary Data.
Feature importance
Feature importances were calculated using mimic-based model explanation of the voting ensemble model for Types 1 through 24. The mimic explainer works by training global surrogate models to mimic blackbox model [18]. The surrogate model is an interpretable model, trained to approximate the predictions of a black box model as accurately as possible [19].
In the Voting Ensemble model using Types 1 through 24, Types 3, 5, and 10 were found to have non-zero importance. See Figure 3.
3.1 Repeat Type Prevalence
As shown in Figure 4 and Table 10, many of the repeat types described by Baker et al., 2010 [4] (Table 3) are represented in the sequences analyzed in this study. Specifically, Types 1-10, 12-14, and 19 were found among these isolates. This is in general agreement to a similar report by Willie et al., 2018 [20] using samples collected from Papua New Guinea. They report that Types 1, 2, 6, 7, and 12 were present in almost all (≥ 89%) sequences, Types 3, 5, 8, and 10 were present in most (≥ 56%) sequences, and Type 4, 13, and 19 were seen in ≤ 33% of sequences. In contrast, we see a higher prevalence of Types 4 and 19 and a lower prevalence of Type 12 than in the previous study.
In another study by Bharti et al., 2016 [21] that used samples collected from multiple sites in India reported that Types 1, 2, 7, and 12 were seen in 100% of their sequences. However, in our sequences from Ethiopia, we see multiple examples where these repeats are not present, especially Type 12.
4 Discussion
Here we show the utility of machine learning in the identification of important factors in malaria diagnosis. Previous modeling by Baker et al., 2005 [1] had shown that the parasitic infection can be diagnosed by looking at the prevalence of particular types of amino acid repeats. The original regression-based model is no longer valid and, in this study, we show that the modeling of Types 2 and 7 using more sophisticated machine learning algorithms fail to produce a reliable model. However, the usage of Types 1 through 24 proves to make effective models to characterize malaria infections in our dataset. Furthermore, the usage of machine learning model explanability helps to pinpoint particular features of interest. In this case, Types 3, 5 and 10 reveal better diagnostic sensitivity for these malaria isolates collected from regions of Ethiopia.
This work posits the idea that RDTs can be revised to accommodate the genetic differences seen in today's malaria infections. Future versions of RDTs may be improved to targets these variants identified in this work to improve sensitivity. While more work is to be done to empirically validate these findings, this in silico simulation may direct where to take experimental testing next. Furthermore, training machine learning models on sets of malaria sequences from other areas such as Papua New Guinea, India, or other areas of Africa may reveal that different repeats are important in those areas, likely suggesting the RDTs may need to be region-specific due to variations in P. falciparum across the globe.
Data Availability
All data, scripts, and model outputs are hosted on GitHub at: https://github.com/colbyford/pfHRP_MLModel
5 Supplementary Materials
All data, scripts, and model outputs are hosted on GitHub at: github.com/colbyford/pfHRP_MLModel
6 Author Contributions
G.A. and L.G. designed and performed the patient recruitment and sampling. G.A, L.G. and D.J. managed ethical approval, funding, and visas. G.A., K.L., K.B., and C.C.D performed the DNA extractions, RT-PCR, PCR, and sequencing of the samples under the direction of L.G., D.J., and E.L. K.B. and D.J. performed the DNA to amino acid translations. C.T.F. performed the motif search for repeat types and performed all the machine learning and model interpretability work. All authors reviewed this manuscript.
8 Competing Interests
The authors declare that the research was conducted in the absence of any commercial, financial, or non-financial competing interests.
9 Ethics Statement
Scientific and ethical clearance was obtained from the Institutional Scientific and Ethical Review Boards of Addis Ababa University in Ethiopia and The University of North Carolina, Charlotte, USA. Written informed consent/assent for study participation was obtained from all consenting heads of households, parents/guardians (for minors under age of 18), and each individual who was willing to participate in the study.
7 Acknowledgements
We thank the family of Carol Grotnes Belk for financial support. We acknowledge the administrative, salary, and laboratory support of these entities at the University if North Carolina at Charlotte: the Office of International Student Scholars, the Bioinformatics Research Center, the College of Computing and Informatics, the Department of Bioinformatics and Genomics, the College of Liberal Arts and Sciences, and the Department of Biological Sciences. The field data collection portion of this work was funded in part by Addis Ababa University Thematic Research.