DEVELOPMENT AND VALIDATION OF A MODEL FOR THE PREDICTION OF MORTALITY IN CHILDREN UNDER FIVE YEARS WITH CLINICAL PNEUMONIA IN RURAL GAMBIA

Background: Pneumonia is the leading cause of death in children aged 1-59 months. Prediction models for child pneumonia mortality have been developed using regression methods but their performance is insufficient for clinical use. Methods: We used a variety of machine learning methods to develop a predictive model for mortality in children with clinical pneumonia enrolled in population-based surveillance in the Basse Health and Demographic Surveillance System in rural Gambia (n=11,012). Four machine learning algorithms (support vector machine, random forest, artifical neural network, and regularized logistic regression) were implemented, fitting all possible combinations of two or more of 16 selected features. Models were shortlisted based on their training set performance , the number of included features, and the reliability of feature measurement. The final model was selected considering its clinical interpretability. Results: When we applied the final model to the test set (55 deaths), the area under the Receiver Operating Characteristic Curve was 0.88 (95% confidence interval: 0.84, 0.91), sensitivity was 0.78 and specificity was 0.77. Conclusions: Our evaluation of multiple machine learning methods combined with minimal and pragmatic feature selection led to a predictive model with very good performance. We plan further validation of our model in different populations.

. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint (3466 of max 3500) Pneumonia accounted for an estimated 921,000 childhood deaths in 2015, more than any other single condition. 1 The clinical status of children with pneumonia can rapidly deteriorate but existing mortality prediction tools have limited reliability. 2 The challenges of pneumonia diagnosis, assessment of severity, and the prediction of mortality are most acutely felt in low-resource settings, where expert personnel and well-equipped facilities are lacking.
In high-income settings, several clinical prediction scores of pneumonia severity have been developed and proven useful to triage patients for higher levels of clinical care, both in adults [3][4][5] and, less often in children. 6,7 Due to the different characteristics of the population and health care services, these scales are poorly suited to low-income settings, 6 where the ability to predict mortality in children with clinical pneumonia would be valuable both in primary care, to ensure appropriate referral for admission, and also in secondary and tertiary care, for referral to more intensive clinical support. In this context, a number of models to predict mortality in children with clinical pneumonia have been developed, providing moderately good levels of prediction. 2 Most of these models were developed using regression methods such as logistic regression (e.g. the RISC 8 and PERCH 2 scores. While a strength of regression models is their interpretability, superiror classification accuracy can be obtained from machine learning algorithms such as random forest (RF), support vector machines (SVM), and artificial neural networks (ANN). 9 Our goal in this study was to develop and validate a predictive machine learning model for pneumonia mortality in children using variables readily available to clinicians .

METHODS
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint

Study population
We used population-based surveillance data from the Pneumococcal Surveillance Project from June 4 th 2009 until December 31st 2017. 10  Surveillance System who met surveillance criteria for suspected pneumonia, sepsis or meningitis and were referred to clinicians for standardized review. In addition, our dataset included patients from January 1st 2011 until December 31st 2017 who did not meet surveillance criteria for suspected pneumonia, sepsis or meningitis or were not resident in the Basse Health & Demographic Surveillance System but had standardized clinician review,effectively including all children admitted at the Basse Health Centre. This health centre is a secondary care facility, and the only major health centre in Upper River Region, serving the largely rural population of approximately 260,000 on both banks of the River Gambia. 10 Inclusion criteria were age 2-59 months and having clinical pneumonia, defined as cough or difficulty breathing accompanied by one or more of: raised respiratory rate for age (≥50 breaths/minute if less than one year old; ≥40 breaths/minute if older than one year), lower chest wall indrawing, inability to sit or feed, convulsions, or oxygen saturation <92 percent. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint allowed us to maintain the chronology of the data. Performance was not sensitive to the training and testing temporal cut-off.

Outcome and predictors
Our outcome of interest was hospital based mortality. Potential explanatory variables in the predictive modelling process based on their clinical relevance are summarised in Table 1 Haemoglobin concentration was excluded due to missing data. Nasal flaring and grunting were not included due to concerns of substantial subjectivity and inter-observer disagreement. In addition, 'abnormal conscious state' was discarded for being a near zerovariance predictor. 11 All variables were collected at the time of admission. Impossible or implausible values were removed and only subjects with data for all features were included in the analyses (complete-case analyses), which represented over 94 percent of the original dataset (Table 1, Figure 1).
For this project, we asked a panel of three clinicians (one of the authors and Drs. Yekini-Ajao Olatunji and Galega Babila Lobga) to classify each feature based on reliability, the amount of skill required by the person measuring it, and consistency, with the agreed classification shown in Table 1.

Model development
Types of Model. There is a large variety of machine learning algorithms , but, a priori, no best choice. 9 Each has a distinct approach to the classification task: Random forests consist of multiple random decision trees, where each tree is built on a random sample from the original data and, at each tree node, a subset of features are randomly selected to generate the split that minimizes the probability that a randomly selected sample from a node will be . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint incorrectly classified. 12 Support Vector Machines aim to find a boundary that separates two groups in the data by maximising the sum of the distance between the boundary and the chosen support vectors. 13 Artificial Neural Networks consist of an input layer, an output layer and a set of hidden layers with a certain number of neurons. Neurons in each layer are connected by sets of weights. The algorithm's goal is to find the combination of weights that will minimize the amount of errors in the output layer. 14 Finally, we also used Regularized Logistic Regression (RLR), which uses maximum-likelihood estimation to find the coefficient values that minimize the error in the predicted probabilities. As part of the regression family, this method is most similar to the algorithm used during the development of previous scales, while allowing us to penalise increasing numbers of features (to limit overfitting) and apply different weights to the misclassification of cases and non-cases. 15 Feature selection during modelling. Since the number of potential predictors in our dataset was relatively low (16), we applied each of the selected machine learning algorithms to every possible combination of two or more features, resulting in a total of 65,519 feature combinations. We therefore did not perform further predictor selection. Given our exhaustive approach to the feature selection, we did not include interaction terms to avoid the risk of overfitting. 16 We explored the presence of non-linear relationships with fractional polynomials, but saw no consistent improvement in fit. Consequently, all continuous predictors were fitted as linear terms (centred and scaled by subtracting the mean and dividing by the standard deviation).
Sample size. Since this was a post-hoc analysis, no power calculation was possible. A rule of thumb to have at least 10 outcome events per model feature has been suggested by some is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint authors based on empirical research. [17][18][19] With over 200 events and a maximum of 16 parameters, our dataset meets this criterion.
Model generation and internal validation. Using the training set, we generated four models for each of the 65,519 feature combinations (one for each of the machine learning algorithms considered). Each model was developed using repeated 10-fold cross-validation (5 repetitions) with adaptive resampling to optimize hyper-parameter tuning(starting with a random set of 20 hyper-parameter combinations each time). 20,21 The data set is highly imbalanced (only 2 percent of patients died) and we took the following precautions to avoid bias towards the majority class (survivors): First, for each algorithm four different weighing schemes penalised misclassification of death (four times more, two times more, 1.5 times more, or without penalty). Second, we used the Synthetic Minority Over-Sampling Technique 22 to balance the number of events and non-events by up-sampling the deaths and under-sumpling the number of non-events (survivors). To avoid overfitting with minority over-sampling, we limited the number of up-sampling of deaths to five times the original number of cases, forcing the down-sampling of non-event cases to result in an equal number of events (after up-sampling) and non-events. Third, we used a threshold-invariant metric -the Area Under the receiver operating characteristic Curve (AUC)-to avoid depending on threshold values when assessing the models' performance on the training set ( Figure 1).
For each model, we also identified the best threshold to classify patients (in the training set) by identifying the point in the receiver operating characteristic curve closest to the top-left corner (perfect sensitivity and specificity). All models were generated using R software, and . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; the caret package. 23,24 Due to the high number of feature combinations, we ran our code in an embarassingly parallel format on a high performance computer.
Final model selection. In order to reduce the risk of over-fitting and to increase the generalisability of our final model we did not rely solely on the performance metrics of the generated models during training, but also on the number of features and the number of less-reliable features ( Table 1) included in the model. We therefore shortlisted, from the Final model validation. Once the final model was chosen we evaluated its performance in the held-out test set: we used the model to generate for each subject a prediction probability of mortality, using the best threshold from the training set to classify them as dead or alive, and compared these predictions with the actual outcomes. We generated measures of calibration (density and calibration plots), discrimination (AUC) and classification (sensitivity and specificity). Calibration evaluates the agreement between a model's prediction and the true outcome. Discrimination is the model's ability to . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint differentiate between survivors and deaths. Classification refers to the performance of the model after introducing a probability threshold. 16 We also generated the model's learning curve, which consists of two curves generated by assessing the performance (AUC in our case) of the model when applied to the training set and the test set, respectively, at increasing fractions of the training set. The shape and dynamics of a learning curve can be used to assess if the model was overfitting (lines getting closer but then separating again over increased fractions of the training set used) and how well it was generalising (distance between the two lines). 25

RESULTS
The initial dataset consisted of 11,698 subjects, of which 686 (5.9%) were excluded due to missing values ( Figure 1). The remaining 11,012 subjects were non-randomly split into a training set (7,341 children, 167 deaths) and a test set (3,671 children, 55 deaths) based on their date of admission. Table 1  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint for individual and batch mortality predictions, as well as validation of our model in new datasets. Users can input values manually or upload a dataset to obtain the predicted probability of death for each record, as well as a dichotomous classification based on the optimal threshold of 0.44 identified in our training set. Finally, if data on mortality are provided in the dataset, it also allows assessment of model performance in a new populaiton.

DISCUSSION
In this study we treated the prediction of pneumonia mortality in children as a classification problem. Logistic regression is a widely used methodolgy to model the relationship between a set of predictors and a binary outcome variable, but, in terms of classification, it is often outperformed by other machine learning algorithms such as RF, SVM and ANN. 9 These algorithms rely on less assumptions, but at the cost of interpretability and higher risk of overfitting. Given our sparse feature set (16) and our access to a high performance computer, we were able to skip the feature selection process and generate four models (one for each algorithm) for each combination of two or more features. We shortlisted the models that were both parsimonious and predictive, selecting our final model based on the most clincial interpretable features. This final model performed well in both the test and training sets. Overall, our final model has potential to be a useful predictive model for mortality in children under 5 years of age with clinical pneumonia.
As expected, RLR was outperformed by all the other algorithms. While RF and SVM generated more models with AUC ≥ 0.9 than NNET, once we shortlisted the number of models based on the number of features and their measurement reliability we had a similar number of candidate models from each algorithm. This confirms that there is no obvious . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint best algorithm for this type of data, and justifies an approach that initially considers a variety of methods.
There are limited scoring systems or classification methods to identify high-risk children/severity of pneumonia. A formal comparison of our model with these tools was beyond the scope of this paper, but in a recent publication Gallagher et al 2 compared the performance of several scores and classification methods and found that the best scoring system (using the number of danger signs from the 2013 World Health Organisation Pocketbook of Hospital Care for Children 26 ) obtained a high AUC (0.82), although the discrete nature of the score did not allow for a threshold that was both sensitive and specific.

Limitations
Our study has several limitations that could affect the generalisability of our results. First, we noticed that the presence of missing values was more prevalent among children who died than those who survived. If the presence of missing values was related to the severity of the condition the resulting sample after removing cases with missing values would have been healthier than the original one introducing a bias towards less accurate predictions.
Second, we included children in our dataset that met a clinical definition of pneumonia, which was in most cases not confirmed with X-ray, resulting in possible inclusion of children without 'real' pneumonia. We believe this can also be viewed as a strength, as it better reflects the reality in low or middle income countries and/or in rural areas.
Third, our study population is representative of a rural area in The Gambia, but there might be notable differences between this and other regions in sub-Saharan Africa due to access . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint to health care and comorbidities such as human immunodeficiency virus, malaria, tuberculosis, and sepsis. Our final model should be used with caution in other populations until its generalizability has been further validated. Alternatively, practitioners should use the methodology described in this paper to create their own local prediction models.
Fourth, during the development of our candidate models we were limited to 16 features that were consistently collected in the Pneumococcal Surveillance Project. Therefore, there could be other better predictors that we may have missed. For example, nasal flaring, grunting, head nodding and vomiting are some features considered in other pneumonia severity scales but that were not among our starting features.
Machine learning algorithms make predictions based on detecting patterns in training data.
A weakness of this approach is that it can lead to models with poor interpretability, resulting is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint

Strengths
The strengths of our study include its approach to pneumonia mortality in children as a classification problem and its design focused on the generalisability of the final model, not relying only on the performance of the candidate models on the training set, but also taking into account the number of features, their interpretability and reliability. Furthermore, we held out 1/3 of our dataset (with 55 deaths) to test the final model on unseen data.
We addressed the challenges of an imbalanced dataset (very few events) by using three complementary strategies: up and down-sampling (with limited number of up-sampled events), class weighting (penalising the misclassification of events), and using a thresholdinvariant performance metric (avoiding the impact of choosing a threshold on a model's performance during its development).
Despite holding out 1/3 of our dataset and the low incidence of our outcome of interest (only 2 percent deaths), our sample could still be considered large, with more than 10 cases per feature in the training set even for the candidate models with all possible features.
Furthermore, the final model, with 5 features, was developed with more than 30 cases per feature.
Another strength of our models was that they predicted the probability of mortality, allowing us to identify the best classification threshold without being limited by discrete values. Finally, since all measurements come from the same study the definitions and procedures are consistent throughout the data collection.

Implications and conclusion
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint Our final model showed excellent performance not only in the training set, but also in a previously unseen test set. It could be used to assist with triage decisions. Future research will compare our model with existing models and validate our model on different populations to assess its generalizability. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Supplementary Digital Content
The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint  • Repeated cross-validation (10 folds, 5 repetitions) • Adaptive resampling with 20 starting hyper-parameter combinations. • Pre-processing (centre and scaling) applied to each resampled version of the data Addressing class imbalance 1) Class weights (non-cases to cases) • 1/4 to 1; 1/2 to 1; 2/3 to 1; 1 to 1 2) SMOTE subsampling • Up-sample cases to number of cases × 5 • Down-sample non-cases to number of cases × 5 3) Threshold-invariant performance metric (AUC) . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 5, 2021. ; https://doi.org/10.1101/2021.08.04.21260737 doi: medRxiv preprint