ABSTRACT
Objective Electronic health records (EHRs) can improve patient care by enabling systematic identification of patients for targeted decision support. But, this requires scalable learning of computable phenotypes. To this end, we developed the feature engineering automation tool (FEAT) and assessed it in targeting screening for the under-diagnosed, under-treated disease primary aldosteronism.
Materials and Methods We selected 1199 subjects receiving longitudinal care in one health system between 2007 and 2017 and classified them for hypertension (N=608), hypertension with unexplained hypokalemia (N=172), and apparent treatment-resistant hypertension (N=176) by chart review. We derived 331 features from EHR encounters, diagnoses, laboratories, medications, vitals, and notes. We modified FEAT to encourage model parsimony and compared its models’ performance and interpretability to that of expert-curated heuristics and conventional machine learning.
Results FEAT models trained to replicate expert-curated heuristics had higher AUPRC scores than all other models (p < 0.001) except random forests and were smaller than all other models (p < 1e-6) except decision trees. FEAT models trained to predict chart review phenotypes exhibited similar AUPRC scores to penalized logistic regression while being substantially simpler than all other models (p < 1e-6). For treatment-resistant hypertension, FEAT learned a six-feature, clinically intuitive model that demonstrated an adjusted PPV of 0.73 and sensitivity of 0.54 in testing.
Discussion FEAT learns computable phenotypes that approach the performance of expert-curated heuristics and conventional machine learning without sacrificing interpretability.
Conclusion By constructing accurate and interpretable computable phenotypes at scale, FEAT has the potential to facilitate widespread, systematic clinical decision support.
INTRODUCTION
The adoption of electronic health records (EHRs) is transforming medicine by aiding clinical decision making and facilitating translational research.1,2 In order to leverage EHR data, researchers must often define rules or algorithms known as computable phenotypes that transparently identify patient cohorts with certain characteristics or phenotypes of interest.3–5 While there have been significant advances in creating and standardizing computable phenotypes for various conditions, developing accurate computable phenotypes remains a time-consuming and challenging process due to the heterogeneity, imprecision, and high dimensionality of EHR data.1,2,6–8
Various rule-based and machine learning (ML) approaches have been developed for generating computable phenotypes.7 Due to the challenges of learning from messy, high-dimensional, mixed-type data that constitutes EHRs, many recent studies have focused on training large, complex models using ensemble models or deep learning.9–13 Many algorithms employed in these studies, e.g. random forests and neural networks, can perform very well in classification but often lack interpretability, a subjective concept that can be thought of as the extent to which a model can be understood and/or its behavior interpreted by a user.14–18 Many have noted interpretability as a key concern in EHR-based ML models19, particularly in biomedical applications.20
Despite its subjectivity, there are several reasons why interpretable phenotypes are preferable to black-box ML models.21,22 Concise models are easier to understand and apply to existing decision-making frameworks, thus allowing clinicians to corroborate predictions. When a model’s decision-making process is understood, clinicians can verify or second-guess predictions, which should lead to trust and an overall higher quality of clinical decision making. As the FDA’s proposed regulatory framework for the evaluation of automated clinical decision support systems requires clinicians to “independently review the basis for such recommendations,” interpretability will be an important factor in determining the regulatory requirements for future ML deployments.23 In addition, simpler, transparent models may be more easily adjusted as clinical practices change or models are applied in new practice settings.
In this paper, we have improved and then applied the feature engineering automation tool (FEAT) to generate computable phenotypes that are both accurate and interpretable.24–26 FEAT constructs fully interpretable feature representations, encoded as networks, in tandem with fitting a classification model. The representations are evolved using a population-based Pareto optimization algorithm that jointly optimizes model discrimination and complexity.27,28 This approach to model training significantly reduces the feature space while achieving full model simulatability, i.e. the ability of a human user to follow the full decision process of the model.
We have applied FEAT to learn a computable phenotype for identifying patients that should be screened for primary aldosteronism (PA), the most frequent cause of secondary hypertension.29 Epidemiological studies suggest that PA affects ∼1% of US adults, but recent literature demonstrates it is under-screened and under-diagnosed.30–34 Thus, identifying patients who should be screened for PA could improve their care. Using FEAT, we have developed computable phenotypes to identify patients for whom clinical specialty guidelines recommend PA screening30. We observe that phenotypes constructed using FEAT are on average significantly less complex than those learned by conventional ML methods but achieve similar discriminative performance.
In the following section, we discuss related work on interpretability in more detail to motivate our interest in developing and applying FEAT to learning computable phenotypes. We then describe our data collection, the FEAT method, and the experimental design. Finally, we present performance comparisons and interpret the models produced by FEAT and other ML methods for the three phenotypes of interest, ending with a discussion of the implications and future work that follows from our analysis.
BACKGROUND AND SIGNIFICANCE
There are two overarching approaches to interpretable modeling. The first is to apply a post-hoc analysis tool to a black box model that determines empirically which factors are relevant to the model’s predictions. Examples of post-hoc methods include permutation importance35,36, LIME37, and SHAP38. SHAP values in particular can be very useful for describing how a black-box model behaves under specific input conditions39. However, these approaches do not describe the mechanism by which factors result in the predictions. Furthermore, since these importance scores do not describe the behavior of the model over all input conditions, it is challenging to predict model behavior as inputs change.
The second approach to interpretable modeling is to focus on learning concise models that are self-explanatory. As Lundberg et al. put it, “the best explanation of a simple model is the model itself.”38 The most commonly used method in this category is logistic regression, often employed with regularization approaches, such as the lasso and ridge regression.23,24 Decision trees and Bayesian rule lists can generate interpretable models when constrained to small tree depths and low rule count, respectively.20,25 Yet these approaches are limited in that smaller models may not represent complex data trends, and larger models are often uninterpretable.17,26 In regularized regression and pruned decision trees, the trade-off between simplicity and explanatory power is left to be tuned by the user. More sophisticated strategies can characterize the trade-off space between model complexity and model accuracy, such as Pareto optimization with symbolic regression.28 Symbolic regression is a method of learning the functional form and parameters of a model using a randomized, heuristic search process such as evolutionary computation.40 Pareto optimization refers to a multi-objective optimization process in which preference relations between models are determined by their closeness to the “Pareto front”, which is a set of points that represent the best observed trade-offs between objectives.
Symbolic regression with Pareto optimization has been used to develop simple models in other domains, such as physics41, biology42, fluid dynamics43, and wind energy.44 To our knowledge, this is the first work to explore the application of symbolic regression with Pareto optimization to EHR phenotyping.
MATERIALS AND METHODS
In the first part of this section, we describe the development of the cohort used to generate the computable phenotypes. In the second part, we describe the FEAT algorithm. We introduce the methodological changes made with the intention of promoting conciseness in the resulting models. In the final part, we describe the empirical studies that validated our methodological updates to FEAT as well as our study of computable phenotypes targeting patients that meet guideline-based criteria for being screened for secondary hypertension.
Benchmark Data
To benchmark changes to FEAT, we applied variant methods to 20 classification tasks in the Penn Machine Learning Benchmark (PMLB)45, described in the Supplemental Material.
Patients
We studied 1200 patients receiving longitudinal primary care in the University of Pennsylvania Healthcare System (UPHS). Patients included had at least five outpatient visits in at least three separate years between 2007 and 2017, at least two encounters at one of 40 primary care practice sites, and were 18 years or older in 2018. A set of 1000 random patients from this cohort were divided into 800 for model training and 200 for model testing. One subject in the random training set was excluded because of a mid-study change in enterprise master patient index (EMPI) identifier.
For each subject, a study physician (I.A.) reviewed clinical charts and classified patients with three phenotypes of increasing complexity for hypertension related to screening guidelines for PA: (A) hypertension, (B) hypertension with unexplained hypokalemia (HTN-hk), and (C) apparent treatment-resistant hypertension (aTRH). Classification was based on JNC7 Guidelines on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure46. Unclear cases were further reviewed by an additional study physician (D.S.H. or J.C.). See Supplemental Material for further details.
Preliminary expert-curated and final expert-curated heuristics for aTRH and HTN-hk (see below) were used to identify an additional 50 subjects each for model training and model testing, respectively. This yielded a total of 899 subjects for the training set and 300 subjects in the testing set. This study protocol was reviewed and approved by University of Pennsylvania IRB (#827260).
Clinical Data
We extracted 331 features from EHR clinical data repository Penn Data Store and EPIC Clarity reporting database. Demographic and encounter features included age, race, sex, categorized distance from zip code 19104, weight, BMI, blood pressures, and number of elevated blood pressures. Longitudinal features were aggregated as min, max, median, standard deviation, and skewness. The 34 most common laboratory test results (complete metabolic panel, complete blood count with differential, lipids, TSH, and hemoglobin A1c) with < 33% missingness were summarized as min, max, median, 1st quartile, and 3rd quartile. Diagnosis codes for hypertension, associated comorbidities, and indications for anti-hypertensives were aggregated and summarized as median per year and sum. Medication prescriptions were summarized as the number of days prescribed for each antihypertensive class; count of encounters while prescribed 1, 2, 3, or 4 or more medications were summarized as sum, median, standard deviation, and skewness, as well as the sum of elevated blood pressures at those encounters. Regular expressions were applied to clinical notes to identify mentions of ‘hypertension’ and variants thereof, summarized as counts. Features with values outside of physiologically reasonable ranges, less than 5% non-zero counts, or variance less than 0.05 were excluded. Missing values were median-imputed. The full data dictionary is available as Supplemental Data.
Construction of expert-curated heuristics
Next, computable phenotypes (heuristics) were manually curated for the three target phenotypes by expert review of EHR data and several iterations of proposing, applying, and evaluating the heuristics. Heuristics were initially developed from a set of random training patients. A preliminary set of heuristics for HTN-hk and aTRH were used to identify 50 patients, and iteratively evaluated and updated. Thus, final heuristics were developed from the entire set of 799 random and 100 targeted training patients. They were then used to flag an additional 100 patients for the held-out testing set.
The heuristic designed for hypertension queried for a history of two or more diagnosis codes for hypertension (ICD-9: 401.*, 405.*; ICD-10: I10.*, I15.*). For HTN-hk, we labeled patients with at least two diagnosis codes for hypokalemia (ICD-9: 276.8; ICD-10: E87.6); or at least two outpatient encounters with low potassium measurement (< 3.6 mmol/L); or at least two prescriptions for an oral potassium supplement. For aTRH, we labeled patients (1) with documentation of at least 2 out of 5 consecutive outpatient encounters with elevated blood pressure (systolic blood pressure >= 140 mmHg or diastolic blood pressure >= 90 mmHg) while on antihypertensive medications from 3 distinct classes for at least 30 days prior to the elevated blood pressures or (2) prescribed four or more antihypertensive drug classes for at least 30 days. Exclusion criteria for aTRH included patients with diagnosis codes for heart failure or transplant (ICD-9: 428.*, V42.1; ICD-10: 150.*, Z94.1) or moderate to severe chronic kidney disease (estimated Glomerular Filtration Rate (MDRD) < 45 mL/min/1.73 m2) prior to meeting the above criteria.
Feature Engineering Automation Tool (FEAT)
We adapted a recent method for learning informative feature representations called FEAT for automated clinical phenotyping (https://lacava.github.io/feat).24–26 For this task, we are interested in learning a classification model from a set of N paired samples, {(yi, xi), i = 1, …, N}, with binary labels y ∈ {0,1} and attributes x ∈ Rd. FEAT attempts to learn a set of features for a logistic regression model of the form
where ϕ(x) is a p-dimensional vector of transformations of x learned from FEAT’s optimization process. The coefficients β = [β1, …, βp] are associated with each of these transformed features. An overview of the FEAT algorithm is given in Fig. 1.
How FEAT works. (A) Steps in the genetic programming process. Candidate models are initialized in a population; the best models (parents) are selected via epsilon-lexicase selection; offspring are created by applying variation operations to the parents; and then parents and offspring compete in a survival step using NSGA-II [22]. The process then repeats. (B) The evaluation of a candidate models’ complexity and performance in Pareto Optimization framework in the Survival step. (C) Example model in which input features are transformed by boolean functions with or without threshold operators.
For the purposes of this study, we restricted the transformation operators to boolean functions: <, >, AND, OR, NOT. This means that FEAT only searches the space of representations consisting of these operators and the input features. We include operators that use Gini impurity to choose the split threshold for each feature in an equivalent way to classification and regression trees. Note that because the optimization process includes mutation to or insertion of new input features, it allows for non-greedy search to occur to find the best fit for the problem at hand, in contrast to decision trees.
To encourage model parsimony, we modified FEAT in two distinct ways. First, to handle high-dimensional data, rather than fitting a multivariate linear model to all the data at the start of optimization, we sampled the input data based on univariate logistic regression coefficients. Second, we added a post-run simplification procedure to shrink the final feature representation without significantly altering its behavior. This post-run simplification procedure consists of 1) explicitly removing redundant serial logical operators, 2) adaptively pruning highly correlated components of representations, and 3) applying random deletion mutations to the features in a hill-climbing fashion. For a detailed description of these changes and a benchmark validation of their effectiveness, see the Supplemental Material.
Comparator Methods
To assess how FEAT compares to conventional ML models, we tested five supervised classifiers: LASSO-penalized logistic regression (L1 LR), ridge-penalized logistic regression (LR L2), decision tree (DT), random forest (RF), and Gaussian Naïve Bayes (GNB). Hyperparameters for each of the models were optimized using 5-fold nested cross-validation. All of the comparator methods were implemented using Scikit-learn47. We report the mean test area under the precision-recall curve (AUPRC) and area under the receiver-operating curve (AUROC) for all experiments. AUPRC is calculated as average precision (see sklearn.metrics.average_precision_score, scikit-learn version 0.23.2). In addition to comparing discrimination performance, we compare the size of the final models. For the tree-based methods (FEAT, decision tree, and random forest), we define the size of the models to be the total number of nodes in the trees. For the linear methods and GNB, we define the size to be the number of predictors with non-zero coefficients in the final model. Model performance and model sizes were compared using paired Wilcoxon rank-sum tests. Model thresholds were selected in training set to achieve a positive-predictive value (PPV) in the longitudinal, primary care cohort of 0.70. Study code is available in this repository: https://bitbucket.org/hermanlab/ehr_feat/.
RESULTS
Development of automated phenotyping method
To automatically construct computable phenotypes whose outputs are directly interpretable by clinical practitioners, we extended FEAT to better implement boolean logic, added procedures to encourage model parsimony, and developed approaches for improving robustness. To evaluate these modifications, we applied them to a set of benchmark datasets45 that were similar in shape to our EHR dataset. We found that restricting operators and simplifying models did not significantly impair classification performance but did decrease the size of resulting models considerably (Supp. Fig. 1; p = 7.2e-9). Detailed results are available in the Supplemental Material.
Recreating manually curated computable phenotypes
We next applied our optimized FEAT method to learn models to recapitulate the expert-curated heuristics for hypertension, HTN-hk, and aTRH from a training set of 899 subjects. For each heuristic, we ran 50 trials of 5-fold cross-validation on shuffled training datasets and seeds and averaged test scores across folds (Fig. 2, top row; Table 1). Across all three heuristics, FEAT models achieved higher AUPRC scores (p < 0.001; Supplemental Fig. 4) than all other models except RF. FEAT models were significantly smaller than all other models (p < 1×10−6) except decision trees.
Computable phenotype discrimination and size for each target phenotype.
AUPRC scores for phenotyping models trained in 5-fold cross-validation over 50 iterations, each averaged across testing folds. Each subplot represents a different training outcome; heuristics are shown in the top row, and chart-review diagnoses are shown in the bottom row.
Automated learning of computable phenotypes
Next, we compared the performance of models trained to predict the chart-review phenotypes (Fig. 2, bottom; Table 1), which were present in 423 (47%), 93 (10%), and 103 (11%) subjects, respectively. Across all phenotypes, FEAT models achieved AUPRC scores that were higher than GNB, LR L2, and DT models (p < 0.001; Supplemental Fig. 4), comparable to LR L1 models (p >0.99), and lower than RF models (p < 1e-6). These relationships were relatively consistent across outcomes, except that FEAT models appeared to also outperform LR L1 for HTN-hk. FEAT models were significantly smaller than all other models (p < 1e-6); on average, models were approximately 1800 times smaller than RF models and 2.9 times smaller than LRL1 models. We next explored the trade-off between model performance and complexity for heuristic and chart-review trained models (Fig. 3). The FEAT models clustered near the high-performance, low-complexity region of this tradeoff space (top left), indicating that they learned a relatively efficient trade-off between these two objectives.
Each point shows the cross-validation testing AUPRC (y-axis) and size (x-axis) for models trained in 50 repeat trials for each method. Each subplot represents a different expert-curated heuristic (top row) or chart review phenotype (bottom). The ideal model is discriminative and simple, meaning it is near the top left corner.
Precision-recall curves (left) and receiver-operating curves (right) for phenotyping models trained to predict chart review classifications for aTRH. Values shown are means of test performance in 5-fold cross-validation iterated 50 times.
For the most complex phenotype, aTRH, FEAT models achieved a median AUPRC score of 0.69 (interquartile range [IQR]: 0.05) at a median size of 9.8 (IQR: 1.8). These models showed reasonable discrimination across all potential decision thresholds, as depicted by PRC and ROC curves (Fig. 4). Of note, the expert-curated heuristic demonstrated superior discrimination to all ML models at its single operating point.
Assessment of Model Generalization and Clinical Utility
Next, we applied the methods refined by cross-validation to learn models from the entire training set and assessed their performance on a test set of 300 subjects, including 185 (61%), 79 (26%), and 73 (24%) subjects for each chart-review phenotype. Model performance and size (Table 2) were relatively consistent with cross-validation estimates. Most appeared to have slightly better discrimination, likely due in part to the great enrichment for cases in the testing cohort. For hypertension, HTN-hk, and aTRH, the FEAT models demonstrated AUPRC scores of 0.99, 0.96, and 0.80, and AUROC scores of 0.99, 0.98, and 0.94, respectively. For HTN-hk, these FEAT models improved upon the AUPRC of the expert-curated heuristics by 18%, and on the other two heuristics, FEAT models and expert heuristics performed similarly (within 2% of each other).
Final model performance on test set for each method, first when trained to predict heuristics, and then when trained to predict chart review phenotypes.
To further evaluate the utility of the resulting models, we selected an interpretive threshold from training data expected to yield a PPV of 0.7 in our target population of patients receiving longitudinal, primary care. For aTRH, we assumed a case prevalence of 7.5% based on the frequency observed in our training set and meta-analyses48. This resulted in the selection of a threshold of 0.596, which demonstrated a sensitivity of 0.62 in training. Among the 200 randomly drawn test patients, this FEAT model yielded an adjusted PPV of 0.73 and sensitivity of 0.54. In comparison, the heuristic showed an adjusted PPV of 0.87 and sensitivity of 0.92. Among the 100 test patients flagged by the heuristics, the final FEAT model had a PPV of 0.87, compared to a PPV 0.83 for the expert heuristic.
Model Interpretability
Finally, we evaluated the relative interpretability of the resulting models, focusing on the models for predicting aTRH. The final FEAT model was concise and interpretable (Fig. 5). The FEAT model assigns risk according to the following factors, in order of absolute coefficient magnitudes: first, a history of more than one encounter while prescribed three or more anti-hypertensive medications (β = 1.33); second, a mean systolic blood pressure above 128.6 mmHg (β = 0.95); third, a history of low variability (standard deviation) in the number of encounters while prescribed two anti-hypertensive medications per year (β = -0.52); fourth, a history of a median of 1.25 or more encounters per year while on four or more hypertension medications (β = 0.49); fifth, more than 40 mentions of hypertension in the patient notes (β = 0.42); and sixth, a maximum total calcium greater than 10.1 mg/dL (β = 0.40). To investigate the factors underlying the maximum calcium feature, we explored its associations. We found that subjects with aTRH were in fact more likely to have elevated maximum calcium (OR=4.4, p=4×10−9) and that these elevations were in turn associated with days prescribed thiazide diuretics (OR=1.5 per SD, p=3×10−6) and beta-blockers (OR=1.4, p=2×10−4).
The input features are shown on the left followed by the learned thresholds, the multiplication coefficients, and the summation. Note, the subsequent logit transformation and interpretive threshold is not depicted.
None of the other derived models can be described in such compact, clear language. So, to compare and contrast FEAT with other methods, we calculated SHAP values38 for the test subjects. SHAP values summarize the impact of input variables on model outputs by generating an additive feature attribution model. Positive and negative SHAP values indicate an increase and decrease in predictions, respectively. The summary plots of SHAP values (Fig. 6A, 6C) depict the distribution of SHAP values relative to the magnitude of each input variable, with each dot representing a single test subject. The decision plots of SHAP values (Fig. 6B, D) illustrate how each feature contributes to predictions for several individual subjects.
SHAP summary (A) and decision plots (B) for LR L1 and the summary (C) and decision (D) plots for FEAT. The summary plots (A,C) indicate the most important features, ranked by the mean absolute SHAP value for test data. The decision plots (B,D) show a sample of 10 positive and 10 negative point predictions by the models, with dash-dotted lines indicating misclassifications. The labels in the summary and decision plots identify which feature is responsible for the incremental change in the model score at each level.
The FEAT summary plot (Fig. 6C) reflects the simplicity of the FEAT model. For the five dichotomized features, each patient’s prediction is either increased or decreased a fixed increment. The one continuous feature affects each patient distinctly, but the effect has a clear directionality, i.e. high variability in the number of encounters on two anti-hypertensive medications decreases the prediction. These simple effects translate into clear, largely self-explanatory interpretations for individual subjects as to “why” the model is calling them positive or negative (Fig. 6D). The positive-slope increases in output show that most patients predicted to be positive have 2 or more encounters on 3 anti-hypertensive medications. They also either have high mean systolic blood pressure and more than 40 mentions of hypertension in notes or multiple encounters per year while prescribed 4 or more anti-hypertensive medications.
In contrast, the LR L1 (Fig. 6A,B) and RF (Supp. Fig. 2) summary and decision plots reflect much more complicated models, in which many features confer small contributions to the prediction scores. The summary plots show the modest effect of each of the 20 features with the highest model coefficients (LR L1) or mean absolute SHAP value. The decision plots demonstrate that each patient has a distinct “reason” for a positive or negative prediction, determined by a combination of many features. In addition, there is also considerable signal from the features not depicted, as evident in the variable intercepts between each patient and the bottom model output value x-axis. Notably, for the LR L1 model many of the features depicted (e.g. minimum HDL cholesterol) are not intuitively linked to the phenotype, likely due to feature co-linearity. To address this, we also calculated LR L1 SHAP values after adjusting for feature covariance (Supp. Fig 3). After adjustment, the top features (e.g. # enc 4+ meds, median) now match clinical intuition. That said, the resulting plots still depict a more complicated relationship between features and SHAP values and the persistence of a large number of features with small individual effects. As a result, we cannot simply explain for most subjects “why” the LR L1 or RF models are predicting them as positive or negative. For the sake of comparison, similarly accounting for co-linearity in the FEAT model reinforces the explainability of its individual subject predictions (Supp. Fig. 3C,D).
Of note, the FEAT models’ interpretability does have costs. For instance, some patients were classified as positive by the model but excluded by chart review because of heart failure or chronic kidney disease (Fig. 6D). In contrast, the LR L1 model appears to learn to lower predictions based on maximum creatinine or heart failure diagnosis codes (Supp. Fig. 3A). Such features were considered in FEAT training and were included in 4 of the 10 training iterations’ final models, but these models not selected by our algorithm because of their overall higher complexity.
DISCUSSION & CONCLUSION
We developed a computational method to automate the construction of EHR computable phenotypes and applied that method to find patients that should be screened for the under-diagnosed, under-treated disease primary aldosteronism. Conventional approaches to manually build accurate computable phenotypes cannot scale to the expanse of potential clinical use cases. However, by embedding the design goals for such heuristics into ML approaches, it may be possible to automate their development. The expert design of computable phenotypes applies clinical knowledge in an intuitive manner. Our goal in applying FEAT to automatically create such heuristics is to generate a reasonable symbolic model that is highly accurate and interpretable by clinical practitioners.
We have compared FEAT’s ability to learn computable phenotypes to that of expert heuristic curation and standard ML approaches. The models generated by FEAT are more concise and interpretable than other ML approaches that achieve similar levels of accuracy (i.e. LR and random forests). The FEAT models matched the discriminative performance of other models across the varied tasks, except for the random forest model of the most complicated phenotype, aTRH. In this case, the FEAT performed less well than RF yet was completely interpretable.
In comparison to expert-curated heuristics, the FEAT models showed better discrimination for two phenotypes but slightly worse discrimination for aTRH. This underperformance for aTRH was not unexpected for several reasons. First, the FEAT method was not empowered to learn temporal relationships between features that enabled the expert heuristic to achieve specificity, such as including a minimum time interval between meeting anti-hypertension medication criteria and assessment for persistently elevated blood pressure. We expect that future improvements to the feature representation learning method may enable the approach to natively identify such temporal relationships from longitudinal EHR data. Second, the comparison between FEAT and the expert heuristic was biased because the heuristic was used to identify most of the affected test subjects, likely inflating its observed performance. Even beyond classification performance, we believe that FEAT generated models are more adaptable to changing data compared to expert-curated heuristics.
The model that FEAT learned to identify patients with aTRH was both accurate and understandable. Its components matched those of the expert heuristic and are consistent with clinical intuition. The model demonstrated the power of combining complementary sources of information, including medication, vitals, laboratories, and concepts from notes. Finally, it learned an unexpected, but clinically intuitive valuable rule, maximum blood calcium > 10.1 mg/dL. Anti-hypertensive medications, particularly diuretics, can dysregulate calcium homeostasis. We suspect this rule enabled the model to identify a few affected subjects on intensive anti-hypertensive regimens that were missed by the conventional rules interrogating medication and blood pressure.
There are several possible directions for further improving FEAT. For one, the ability of FEAT to recapitulate expert-curated heuristics suggests that simpler expert heuristics, such as anchor variables49, may be leveraged as teachers in a semi-supervised approach. This could be implemented with multi-stage learning, first to predict heuristics and then to predict chart-review. Or, expert heuristics could be encoded as syntax trees and used to seed initial runs of FEAT. To improve the aTRH phenotype, FEAT transformations should include temporal reasoning. Another limitation of this work is the non-trivial, manual feature engineering upstream of FEAT. Future work could peel back this manual feature engineering by enabling FEAT to learn from longitudinal data. Although the search space would considerably increase, there is more opportunity to learn temporal relationships. And as we apply this tool to less engineered input features and harder problems, the search space will be very large. The approach would benefit considerably from learning on top of a framework that encodes expert clinical knowledge, such as ontologies and knowledge graphs50. The incorporation of expert knowledge would improve search efficiency and potentially performance, while maintaining interpretability.
In summary, FEAT can effectively learn highly accurate and interpretable computable phenotypes. Further refinements to the learning framework and process should eventually allow experts to review automated computable phenotypes, rather than manually design them. We believe such tools will enable widespread implementation of computable phenotype-triggered clinical decision support.
Data Availability
Public data used for benchmarking FEAT is available from www.github.com/EpistasisLab/pmlb. Due to privacy concerns, data used for clinical phenotyping cannot be released under the terms of the IRB approval.
Competing Interests
None declared
Funding
This work was supported by Grant 2019084 from the Doris Duke Charitable Foundation and the University of Pennsylvania. W. La Cava was supported by NIH grant K99 LM012926. J.H. Moore and W. La Cava were supported by NIH grant R01 LM010098.
Acknowledgments
We would like to thank Debbie Cohen for helpful discussions about caring for patients with secondary hypertension.