CoAI: Cost-Aware Artificial Intelligence for Health Care

The recent emergence of accurate artificial intelligence (AI) models for disease diagnosis raises the possibility that AI-based clinical decision support could substantially lower the workload of healthcare providers. However, for this to occur, the input data to an AI predictive model, i.e., the patient's features, must themselves be low-cost, that is, efficient, inexpensive, or low-effort to acquire. When time or financial resources for gathering data are limited, as in emergency or critical care medicine, modern high-accuracy AI models that use thousands of patient features are likely impractical. To address this problem, we developed the CoAI (Cost-aware AI) framework to enable any kind of AI predictive model (e.g., deep neural networks, tree ensemble models, etc.) to make accurate predictions given a small number of low-cost features. We show that CoAI dramatically reduces the cost of predicting prehospital acute traumatic coagulopathy, intensive care mortality, and outpatient mortality relative to existing risk scores, while improving prediction accuracy. It also outperforms existing state-of-the-art cost-sensitive prediction approaches in terms of predictive performance, model cost, and training time. Extrapolating these results to all trauma patients in the United States shows that, at a fixed false positive rate, CoAI could alert providers of tens of thousands more dangerous events than other risk scores while reducing providers' data-gathering time by about 90 percent, leading to a savings of 200,000 cumulative hours per year across all providers. We extrapolate similar increases in clinical utility for CoAI in intensive care.These benefits stem from several unique strengths: First, CoAI uses axiomatic feature attribution methods that enable precise estimation of feature importance. Second, CoAI is model-agnostic, allowing users to choose the predictive model that performs the best for the prediction task and data at hand. Finally, unlike many existing methods, CoAI finds high-performance models within a given budget without any tuning of the cost-vs-performance tradeoff. We believe CoAI will dramatically improve patient care in the domains of medicine in which predictions need to be made with limited time and resources.

at a fixed false positive rate, CoAI could alert providers of tens of thousands more dangerous events than other risk 23 scores while reducing providers' data-gathering time by about 90 percent, leading to a savings of 200,000 cumulative 24 hours per year across all providers. We extrapolate similar increases in clinical utility for CoAI in intensive care. These 25 benefits stem from several unique strengths: First, CoAI uses axiomatic feature attribution methods that enable precise 26 estimation of feature importance. Second, CoAI is model-agnostic, allowing users to choose the predictive model that 27 performs the best for the prediction task and data at hand. Finally, unlike many existing methods, CoAI finds high-28 performance models within a given budget without any tuning of the cost-vs-performance tradeoff. We believe CoAI 29 will dramatically improve patient care in the domains of medicine in which predictions need to be made with limited  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249356 doi: medRxiv preprint Washington State spent a median of just 16 minutes on the scene of trauma incidents [7]. The median time from EMS 48 dispatch to arrival at the hospital was 48 minutes, just under the initial "golden hour" within which treatment affords 49 the best chance of preventing death. These and other time-limited, resource-intensive healthcare situations leave little 50 time for deploying feature-rich AI predictions. A useful alternative AI approach would be to account for real-world 51 limitations by jointly optimizing for data gathering cost -e.g., time, effort, or money -as well as accuracy. Such a 52 model could learn on massive datasets with many features and select the optimal subset for prediction within any time, 53 effort, or monetary budget. Most importantly, it would preserve the high accuracy of AI models while turning the 54 heuristic process of feature selection into a principled optimization problem that the model can automatically solve. 55 To train prediction models with a principled joint optimization, we present a new AI framework, named CoAI (Cost- affecting up to 30% of severely-injured trauma patients [12]. ATC is associated with acute kidney and lung injury, 72 increased transfusion needs, multiple organ failure and an 8-fold increased risk of early death [13,14]. Trauma patients 73 with ATC require complex care and rapid mobilization of hospital resources including massive blood transfusion proto-74 cols and surgical teams [15]. ATC diagnosis is currently based on coagulation testing in the hospital, which delays this 75 time-critical diagnosis and complex healthcare response [16]. Therefore, our goal is to identify trauma patients at high 76 risk of ATC as early as possible before arrival at the hospital to enable faster hospital-based life-saving interventions. 77 Triage is only one of many tasks EMS providers must perform in trauma responses, so we aim to minimize the time 78 required to gather input features. We selected prehospital features from the Harborview trauma registry, identified the and predictive performance with existing tools that predict ATC from prehospital data [17,18]. One such model, the 82 Prediction of Acute Coagulopathy in Trauma (PACT) score, required up to an estimated tenfold more data-gathering 83 time than EMS providers reported being willing to spend during trauma responses, indicating the need for cost-aware 84 modeling [17]. 85 We also examine the problem of in-hospital mortality prediction in critical care patients in our "ICU dataset" - CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249356 doi: medRxiv preprint across patients and hospitals even after adjusting for baseline patient characteristics [19]. For this task, we define 90 cost simply as the number of features in the model. This is motivated by the fact that the mortality risk scores with 91 the highest accuracy, such as the Acute Physiology and Chronic Health Evaluation (APACHE) model, are considered 92 impractical for clinical use because they require a large number of features [20]. Models designed to be more efficient 93 for clinical use, such as the Sequential Organ Failure Assessment (SOFA) and quick SOFA (qSOFA) scores, use as few 94 as 3 features but suffer reduced performance as a result [21]. Our goal is to provide a single method that optimally 95 trades off cost with performance and can provide accurate predictions using any number of features deemed clinically 96 feasible.

97
Finally, we examine 10-year mortality prediction in an "outpatient dataset" from the long-running National Health (1) mortality prediction is a ubiquitous clinical task; a national survey found that internal medicine physicians were 103 asked to predict patient lifespan roughly once per month but felt ill-prepared to do so [23], and (2) a model that is 104 applicable to all 480 million annual primary care visits in the United States [24] is an important case in which to lower 105 the financial cost of risk scores. While long-term mortality scores have been developed for many specific diseases and 106 patient subpopulations [25, 26, 27], we are not aware of a commonly-used outpatient mortality risk score applicable to 107 the general primary-care population. We hypothesize this is due in part to the prohibitive expense of gathering data 108 for routine mortality prediction, and hope to show that accurate, low-cost predictions can be made in this setting.

109
Across all three tasks, CoAI consistently improved predictive performance and lowered cost relative to both existing 110 clinical risk scores and existing AI-based methods. CoAI bridges the gap between AI-based predictive models and 111 the real-world constraints of clinical practice by ensuring that predictive models do not impose undue burden on their 112 users. We believe this work will improve the accuracy of clinical risk predictions while ensuring that such predictions 113 are made efficiently enough to have a real-world impact on patient care. Because of the wide array of prediction tasks and modeling strategies used in developing clinical risk scores, we developed 117 the CoAI (Cost-aware AI) framework, which can be applied to any predictive model (called the base model ) to make 118 it cost-aware ( Figure 1; Method 4). CoAI takes as input a training data set, consisting of patient features X and 119 prediction labels y across patients, costs ci for measuring each feature i, and a budget k representing total acceptable 120 cost for a predictive model. The goal of CoAI is to select a specific feature set S, with total cost no greater than k, 121 that yields the best predictive performance given the budget.

122
This task is computationally challenging because, in general, the exact predictive value of a feature set is unknown 123 without trying to train the model with that specific set of features. Previous approaches which are based on reinforce-124 ment learning (RL) attempt to directly search the exponentially large space of all possible feature sets, while others, 125 such as decision tree-based approaches, simplify the problem via greedy search [9, 10]. The idea of CoAI is to find a 126 feasible solution without enumerating all possible feature subsets and with a more optimal selection of features than 127 greedy search approaches. This is enabled by calculating a single quantitative measure of predictive power for each 128 feature, φi and defining the predictive power of a feature set S as i∈S φi. We then select the feature set S that 129 maximizes i∈S φi subject to i∈S ci ≤ k, which is a knapsack problem that can be efficiently solved. 4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  The trauma and ICU datasets were used for benchmarking CoAI against existing clinical risk scores, while the trauma and outpatient datasets were used for benchmarking CoAI against existing AI methods for cost-sensitive prediction. The first three rows for each dataset show the distribution of age, sex, and outcome of interest for each dataset, respectively. The bottom three rows show the distribution of the next three most important features in each dataset, as measured by feature importance (Method 5). Notably, the trauma dataset and ICU dataset have clear age bias (younger patients are more likely to have traumatic injuries and older ones are more likely to end up in the ICU), while the outpatient dataset has a more uniform distribution as it was designed to be representative of American adults.

5
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021.   CPR, presence of prehospital intubation, injury mechanism, Glasgow Coma Score, and shock index (first prehospital 154 pulse/systolic blood pressure). In our survey of EMS providers, total time cost incurred to obtain all PACT features 155 was 8 minutes (Figure 3d). 156 We compared ROC plots of PACT to those of CoAI at several clinically important points along the cost-performance 157 tradeoff curve (Figure 3c). For the same time cost as the PACT score (8 minutes), CoAI performs as well as a cost-158 unconstrained model (0.84 AU-ROC) and exceeds PACT score performance (0.81 AU-ROC). We also determined a 159 realistic time budget using our survey of EMS providers (Figure 3d), who reported being willing to spend 50 seconds 160 using a predictive risk tool on average. This tightly constrained budget is about tenfold less time than the PACT 161 score requires, but the performance of CoAI within this budget (0.82 AU-ROC) still exceeded PACT's performance. 162 Importantly, CoAI's prehospital prediction performance compares favorably to existing post-hospital admission ATC 163 models; a previous study of AI models for ATC achieved AU-ROCs from 0.83 to 0.86 using vital signs, blood gas 164 measurements, and lab values gathered after patients entered the hospital [33]. CoAI attains similar performance using 165 only tightly time-constrained prehospital data. 166 We also used these results to estimate the impact of deploying CoAI risk scores nationwide (Method 11, Supple-167 mentary Figure 4). We expected that CoAI would both increase sensitivity to ATC and reduce the time cost required 168 to make predictions. By extrapolating from total EMS trauma calls in the National Emergency Medical Services Infor-169 mation System database and ATC base rates in our data, we estimate that 120,000 EMS trauma patients in the United

175
For the ICU dataset (Figure 3b), the APACHE IVa score is known for its high accuracy, however, is difficult for 176 clinicians to use at the bedside because it requires 27 features to be gathered (Method 8; Method 9). Conversely, the 177 qSOFA score uses only 3 features but is much less accurate. CoAI outperforms qSOFA using only a single feature CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.  7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249356 doi: medRxiv preprint 2.3 CoAI builds more flexible models that outperform existing cost-sensitive AI 184 methods. 185 We used the trauma and outpatient datasets to demonstrate how CoAI improves cost and performance over other AI 186 methods. We focus on these two datasets because they have non-uniform feature costs, which provides a more rigorous 187 methodological evaluation. Figure 4 shows cost-versus-performance plots on the (a) trauma and (b) outpatient datasets, 188 with models retrained 100 times on different train/test splits of the data. In these experiments, we train CoAI with both 189 GBMs and logistic regression models as the base predictors to take advantage of its flexibility and model-agnosticism CWCF to produce continuous outputs and ran fewer replicates to accommodate its slower runtime, but performance 205 was still very low (Method 13).

206
The results in Figure 4 show several specific benefits of CoAI's design relative to other cost-sensitive learning  : CoAI improves prediction performance, model cost, and training complexity over competitor methods. Lines are mean performance over 100 random train/test splits, and shaded bands are 95 percent normal confidence intervals. a) In the trauma dataset, linear models are not as effective as complex ones, but GBM-based CoAI has the highest performance at most budgets. CWCF performance increases with budget but never matches the other methods. b) In the outpatient mortality prediction task, both linear and GBM-based CoAI outperform CEGB and CWCF, and linear CoAI achieves the highest performance. c) Performance of best model found versus number of models trained for CoAI, CEGB, and CWCF. CoAI achieves the highest performance with only two model trainings.
9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. Here, we demonstrate that CoAI also achieves better training complexity than other models. All cost-sensitive models 226 require a tuning parameter to control the tradeoff between accuracy and cost. For CoAI, this parameter is the budget 227 itself. Given a fixed target budget, CoAI always yields the optimal model within that budget in a constant number of 228 training rounds. For CEGB, however, the tuning parameter is unitless (representing the ratio between model cost and 229 loss in the optimization objective). Given a fixed target budget, this requires blind tuning until a good enough model 230 is found that fits within the budget. CWCF supports both budget-based and unitless tuning parameters. 231 We tested the training complexity of all three models on a single train-test split of the trauma dataset, where we 232 attempted to maximize performance at a single budget (the EMS provider-preferred time cost of 50 seconds) with each 233 model (Figure 4c; Method 16). Blind tuning on CEGB with binary search requires training over 5 times more models 234 than CoAI to reach a similar level of performance (13 total models). CWCF with unitless tuning takes a large number 235 of trainings to yield even a small increase in performance and never reaches the same level of performance as CoAI or 236 CEGB (128 total models). CWCF with a cost-based tuning parameter requires only a single model training, but yields 237 very low performance. Only CoAI is able to offer high predictive performance with a low training complexity. increased. For each clinical risk score, the top features are listed in order of importance; for APACHE, importance is 244 measured by loss reduction (Method 9) and for PACT, importance is measured by standardized regression coefficient.

245
The qSOFA score weights all variables equally.

246
For the ICU data, CoAI and existing clinical models rely on different subsets of features (Figures 5a and 5b).

247
Although CoAI and APACHE both rely on admission diagnosis and age, CoAI ranks ventilation, FiO2, heart rate, although CoAI also relies on many features not chosen by qSOFA. Notably, the higher-ranked CoAI features tend to 251 be baseline information -age, diagnosis, and ventilation status -rather than specific vital signs. A similar situation 252 arises with the PACT score in the trauma dataset; many PACT features are also important in the CoAI model, but 253 CoAI prioritizes inexpensive data available at the time of dispatch before relying on the many vital signs used in PACT 254 ( Figure 5c). In particular, the intubation and CPR procedures are ranked more highly by CoAI than by PACT.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.

11
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. As AI and ML models become increasingly prevalent in healthcare, they risk imposing a large data-gathering burden 272 on health care providers unless steps are taken to ensure such models can automatically select highly informative, easily 273 acquired features. Our study is the first, to our knowledge, to survey clinical experts, build improved cost-aware clinical 274 risk scores, and evaluate cost-aware models at operating points chosen by clinical providers.

275
Our framework, CoAI, is simple, flexible, and can efficiently adapt any predictive model to make cost-sensitive 276 predictions. In the trauma dataset, CoAI's sophisticated AI models and its automated choice of low-cost, high-277 information features produce predictions more accurate than existing clinical scores at less than one-tenth the data-  CoAI has several desirable properties that make it better suited to cost-aware clinical risk scoring than other 285 AI methods. Its model-agnostic nature enables accurate, low-cost predictions in a wide variety of settings by using 286 complex models for large or nonlinear datasets and simple models for small datasets or ones where linear relationships 287 are expected. Its ability to handle grouped features also makes it a natural fit for data with feature transforms or 288 one-hot encoding and for data where properties of the data acquisition process, such as lab tests that return multiple 289 measurements, result in naturally grouped features.

290
CoAI demonstrates several other benefits relative to existing methods. Most existing tools do not make hard 291 guarantees or provide worst-case bounds on feature acquisition cost -decision tree and RL methods may ask for features 292 indefinitely so long as it improves prediction accuracy enough. CoAI imposes a hard time cutoff in its optimization 293 so that users can be sure it will not recommend excessively costly features. This hard cutoff also serves as the cost-294 versus-performance tuning parameter, helping users to quickly acquire the best possible model for their desired budget 295 without blindly tuning tens or hundreds of models as other methods do. While CWCF is also capable of using a hard   Overall, we believe CoAI has demonstrated the potential to significantly improve clinical risk prediction. Its design 309 treats the ease of gathering features just as importantly as the accuracy of predictions made using them. Our software 310 is easy-to-use and integrated with existing open-source frameworks. We believe CoAI will make clinical risk scores 311 more cost-sensitive, more accurate, and more effective at saving lives. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021.   CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. Trauma data. The trauma data used in this study were gathered over a 10-year period (2007 to 2017) and encompassed 441 over 14,463 emergency department admissions for traumatic injury at a Level 1 Trauma Center. We selected 45 variables 442 that were available in the pre-hospital setting, including dispatch information (injury date, time, cause, and location), 443 demographic information (age, sex), and prehospital vital signs (blood pressure, heart rate, respiratory rate).

444
The outcome in this data was acute traumatic coagulopathy (ATC). We followed [17] and defined ATC as a CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. We assigned costs to features in the outpatient data by referencing Medicare data on payments for lab tests [41]. 484 Physical exam and other free measurements were assigned a cost of zero. Unavailable costs were mean-imputed. The For interpretability purposes, we add small pseudocosts to zero-cost features (small enough that the sum of all such 516 pseudocosts is less than the difference between any two non-zero-cost features). This is not strictly necessary, but does 517 allow ranking of zero-cost features if desired.

519
We needed measures of feature importance as well as feature cost to perform our CoAI analysis. We calculated feature

528
SHAP estimates of feature importance can be calculated for any AI/ML model. Because we use GBM and linear 529 models, we make use of fast algorithms for calculating SHAP values in decision trees and linear models [28,29] . This 530 improves runtime and reduces variability in importance estimates. It is also worth noting that CoAI is compatible 531 with any feature attribution method that assigns a measure of importance φi to each feature -SHAP is not the only 532 method of this type, but is model-agnostic, satisfies desirable axioms, and has fast specialized implementations for the 533 models we consider here.

535
We implemented three additional methods to search for optimal feature sets with respect to feature importance and   18 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 20, 2021. Several scores have been developed for the specific case of ATC in trauma patients. The COAST score -an additive 573 point-based score using abdominal/pelvic injury, chest decompression, temperature, systolic blood pressure, and en-574 trapment -was one of the earliest [18]. The subsequent PACT score was a six-feature logistic regression involving shock 575 index, age, mechanism of injury, Glasgow Coma Score, and prehospital CPR and intubation [17]. Both models use 576 relatively simple prediction methods and a fixed set of features that limit the range of time budgets in which they can 577 be used. A recent study developed a linear model to predict whether military trauma patients would receive massive 578 transfusion of blood products, with the goal of discovering "concrete and rapidly and easily assessable" predictors.

579
This study did not explicitly account for model cost, but noted that some data, including vital signs, may be difficult 580 to acquire or unavailable in the prehospital setting and developed multiple models with different numbers of features, 581 implying the potential value in this area of models that automatically account for cost [38].

582
Many risk scores exist to predict mortality of ICU patients. The most popular include the APACHE, APS, SOFA, 583 and qSOFA models [20,21]. Most of these models take as input a large number of features, while the qSOFA score 584 uses only the Glasgow Coma Score, respiratory rate, and blood pressure at the cost of worse predictive performance.

585
These risk scores all use linear or additive models that aim to either achieve high accuracy with many features, or 586 moderate accuracy with few features. Although mortality prediction in critically ill patients is a topic of great interest 587 in medicine, only a small number of feature sets have been explored. There is no single published model that can make 588 accurate predictions within any feature budget. Finally, although outpatient survival prediction is an important task, 589 we are not aware of a standard clinical tool for this purpose.

590
Method 9 Implementation of Existing Clinical Models 591 We compared to the following clinical models: qSOFA, APS, APACHE IIIa, and APACHE IVa in the ICU dataset, 592 and the PACT score in the trauma dataset. APS, APACHE IIIa, and APACHE IVa were pre-calculated for the ICU 593 dataset. We re-implemented the qSOFA and PACT scores by referring to their respective publications [21,17]. Notably, 594 the qSOFA score required systolic rather than mean blood pressure as an input variable, so we extracted systolic blood 595 pressure data from the eICU dataset and only gave qSOFA access to this variable. We handled missing data in qSOFA 596 by assuming the corresponding binary variable was false (i.e., the input "respiratory rate greater than 22" was always 597 false if respiratory rate was missing). We handled missing data in PACT with mean imputation. We also found that 598 re-training logistic regression models on the variables in the PACT model substantially improved performance. The 599 final plots show results from the re-trained PACT regression.

600
In Figure 5, we ranked features in each clinical risk score by their importance. We ranked PACT features by 601 standardized regression coefficient. The qSOFA score assigned equal weight to all features, so the ordering was arbitrary.

602
The APACHE IVa score did not publish standardized regression coefficients, but did publish the reduction in loss from 603 adding each group of features to the model. We used this data to rank features by importance; for each group of 604 features other than the APS score, we divided the group's loss reduction evenly among the group members to obtain 605 estimates of each feature's loss reduction. For the APS score features, we divided the total loss reduction for APS 606 features proportional to each feature's maximum possible univariate contribution to the APS score. Some binary 607 features modulate other features' univariate contributions; for these features, we assign importance proportional to the 608 difference between the maximum univariate score with the feature on and maximum univariate score with the feature 609 off. Features for which this would result in an importance of 0 are assigned an importance of 1. We divided credit for 610 APS components that depended on multiple features, such as GCS and A-a gradient, evenly among the contributing 611 features. A-a gradient is unique in that it is calculated from PaO2 and several other features but cannot have a nonzero 612 contribution to APS when PaO2 itself has a nonzero univariate contribution; thus, we do not allow it to contribute to 613 PaO2's importance.

614
Method 10 Other Cost-sensitive prediction methods

615
Cost-sensitive prediction is a topic of growing interest in ML and AI. Established techniques, like the LASSO penalty in 616 regression, encourage models to rely on few of their input features but do not generally incorporate the idea that different 617 19 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249356 doi: medRxiv preprint features may have different costs [46]. More recent methods have attempted to minimize the feature acquisition cost 618 for each individual prediction while maximizing its accuracy. Methods involve either perturbing an existing model to 619 determine the most important features [47], using decision trees to divide the data into similar groups while penalizing 620 splits that use expensive features [48,49,9], or applying reinforcement learning (RL) approaches, which use deep 621 learning to simulate the process of asking for features one at a time and then making a prediction [10, 50, 51, 52]. In 622 this paper, we estimate feature importance using state-of-the art axiomatic methods that guarantee features with a 623 greater effect on the output will be ranked more highly. This can be seen as an improvement on perturbation-based 624 methods and allows CoAI to accurately choose the most important features within a given cost budget [28,29]. 625 Despite the emergence of methods for low-cost AI, scant research has used these methods to produce risk scores 626 for real clinical problems. Many approaches are evaluated on toy datasets with random or arbitrary feature costs. We 627 know of only one paper that evaluated cost-sensitive methods on a clinical prediction task, which used a Mechanical 628 Turk survey to gather costs from laypeople rather than attempting to synthesize expert opinion [52]. While study was 629 a valuable attempt to reduce the burden of diagnosis for patients, because costs were measured on a 1-10 subjective  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01.19.21249356 doi: medRxiv preprint