Abstract
Discovery of comorbidities (the concomitant occurrence of distinct medical conditions in the same patient) is a prerequisite for creating forecasting tools for downstream outcomes research. Current comorbidity discovery applications are designed for small datasets and use stratification to control for confounding variables such as age, sex, or ancestry. Stratification lowers false positive rates, but reduces power, as the size of the study cohort is decreased. Here, we describe a Poisson Binomial based approach to comorbidity discovery (PBC) designed for big-data applications that circumvents the need for stratification. PBC adjusts for confounding demographic variables on a per-patient basis, and models temporal relationships. We benchmark PBC using two datasets, the publicly available MIMIC-IV; and the entire Electronic Health Record (EHR) corpus of the University of Utah Hospital System, encompassing over 1.6 million patients, to compute comorbidity statistics on 4,623,841 pairs of potentially comorbid medical terms. The results of this computation are provided as a searchable web resource. Compared to current methods, the PBC approach reduces false positive associations, while retaining statistical power to discover true comorbidities.
Introduction
Comorbidity refers to the concomitant occurrence of distinct medical conditions in the same patient1. Comorbidities can occur together, or sequentially across the patient’s medical history. Exploring these temporal connections offers additional insight into disease progression and disease associations, and promises improved predictive tools for evidence-based medicine2–5. Traditionally, comorbidities have been discovered manually, through human chart review, literature search, and clinical knowledge6. For example, the authors of the Charlson comorbidity index7 selected comorbid diagnoses based on manual chart review for a 559-patient cohort. Likewise, the well-known Elixhauser Comorbidity index, was compiled through review of published studies identifying comorbid conditions8.
Large collections of Electronic Health Records (EHRs) present promising new opportunities for comorbidity discovery. However, manual review of millions of EHR records in search of comorbidities is infeasible, and ab initio means for discovery and temporal ordering of comorbidities using large collections of EHRs is an area ripe for innovation. Current computational approaches to ab initio comorbidity discovery use statistics such as risk ratio, odds ratio, comorbidity-score9, propensity score or ϕ-correlation to measure effect size. P-values are obtained using Fisher’s exact test (or hypergeometric), χ² test, or binomial test10–14. All of these approaches rely on an assumption that each member of the population has a disease probability equal to the population incidence rate. Confounding variables such as age and sex are controlled for by sub-setting the data, a process termed stratification, or alternatively through use of matched case-control cohorts14, 15. Both of these approaches control for confounders, but at the expense of statistical power, because they necessarily reduce sample size. One approach to overcoming this intrinsic limitation is to aggregate massive collections of EHRs, but as we show, even millions of records are too few to explore comorbidities associated with rare diseases when controlling for multiple confounding variables.
PBC models the effects of confounding variables, allowing every sample to be personalized, resulting in improved statistical power compared to stratification. Briefly, PBC uses logistic regression to model how each patient’s demographics impact his or her probability of having a medical term. These personalized probabilities are then used to calculate pairwise expectations and p-values under the Poisson binomial distribution, rather than the hypergeometric and binomial distributions that are used for Fisher’s exact test, and the χ² test, respectively. Moreover, with minor modification, this approach can also be used to temporally order comorbidities and to determine the significance of directionalities. As we demonstrate, use of the Poisson binomial is a significant advantage, because it obviates the need for stratification. The result is increased power for discovery, which we leverage to explore the relationships among diagnoses, medical procedures and medications.
Alongside the need for improved statistical methods, tools are also needed to browse, search, and visualize the network of comorbid medical terms discovered in big-data applications. In a manner similar to Siggaard et al.14, we provide a browser-based query engine for navigating these comorbidities and their temporal relationships within the University of Utah Hospitals system [https://pbc.genetics.utah.edu/lemmon2021/pbc-utah].
In what follows, we describe PBC, explore its behavior, and benchmark its performance using the contents of the University of Utah Health system, and the publicly available MIMIC-IV dataset16. For brevity’s sake, we will refer to co-occurring medical diagnoses, procedures and medications using the single blanket term, comorbidity. We demonstrate how PBC can be used to transform massive EHR datasets into a temporal dependency graph for large-scale ab initio discovery of comorbid relationships and investigations of disease progressions.
Results
Modeling the effects of confounders using logistic regression
We collected records for 1.6 million patients, encompassing 50 million visits and 150 million diagnosis (DX), procedure (PX) and medication (RX) codes from the University of Utah Electronic Data Warehouse (EDW). For the proof of principle analyses presented here, diagnoses were converted from ICD917 and ICD1018 diagnosis codes to Clinical Classification System19 (CCS) multi-level diagnosis codes. CPT20 provider billing codes were converted to CCS multi-level procedure codes. We include both leaf nodes and internal nodes so that the researcher can discover more specific comorbidities (e.g “CCS 2.1.1: cancer of colon”) as well as more general comorbidities (e.g. “CCS 2.1: colorectal cancer” or “CCS 2: neoplasms”). Medications were coded using RxNorm concept unique identifiers (CUIs)21. This procedure reduced the number of distinct medical terms to 1007 DX, 259 PX and 1775 RX codes. We also collected demographic information, including sex, race, ethnicity, insurance class, age and length of medical records.
A logistic regression model (LRM) was determined for each DX, PX and RX term. The LRM includes demographic information for each patient (age, gender, ancestry, ethnicity, insurance type) and EHR exposure (the length and density of a person’s medical record). Since medical terms are included or excluded from year to year, and coding practices vary over time, we also include the date of the patient’s last visit as a control for this effect. The complete list of features is described in Table S1 and Figure S1. The response variable is whether each patient has the term in their medical record. Note that we do not model recurrence - in this analysis we consider only the first instance of a term in a patient’s medical history. We use these LRMs to estimate for each medical term, each patient’s a priori personalized probability of having that term in their medical record.
We used a regularized regression model under the assumption of collinearity in the confounding features. However, there is no requirement for such an assumption; for example, Neural networks could be used in place of LRM, so as to better capture nonlinear relationships between variables. L1 and L2 penalized logistic regression include a value C that prevents overfitting by penalizing large coefficients. Smaller C-values specify stronger regularization (e.g., stronger prevention of over-fitting). To determine the optimal C-value for each LRM, it was necessary to choose a score function for LRM evaluation. We experimented with a number of standard and custom score functions as described in supplemental “Math.pdf”. We optimize C for each LRM using stratified 3-fold cross validation. Grid search is used to evaluate C-values in the set {{10-14, 10-13, 10-12, …, 1012, 1013, 1014}. Figure S2 top panel shows boxplots of the scores reported by each of these score functions. We use entropy as a measure of the ability of a score function to differentiate model quality. From those score functions achieving high entropy, we evaluate the distribution of C-values (Figure S2 bottom panel). We choose Jcutoff for all downstream analysis because it includes fewer outliers than the other methods examined. Jcutoff is based on Youden’s J statistic22, only rather than a 50% probability threshold, the classification threshold is determined empirically so that the total number of predicted positives is equal to the actual count of positives.
We train LRMs using stratified 3-fold cross validation, evaluated using Jcutoff. Under L1 penalized logistic regression, model features can be unselected by setting their coefficients to zero. Figure 1 summarizes how often each demographic feature is included in the trained LRMs. Age at last visit, number of visits, number of terms, and length of medical record were grouped together as “EHR exposure”. Patients may be seen at a non-university clinic, may move in or out of the state over the years, or may have differing proclivities toward visiting the doctor. EHR exposure is an attempt to control for these effects. As Figure 1 makes clear, EHR exposure is always important in predicting whether a patient has a particular medical term.
PBC retains power as features are added and as sample sizes are reduced
The binomial distribution models the discrete probability of the number of successes in N independent experiments each with probability P. A naïve approach to comorbidity analysis assumes the probability of seeing term 1 and term 2 (P𝓉1,𝓉2) in a medical record is the product of the population incidence rates for terms 1 and 2. Knowing the number of patients with both terms 1 and 2 in their medical record, one can calculate a comorbidity p-value using the binomial test of statistical significance. However, because this method does not adjust for demographic factors, the p-values they generate can be driven by effects such as age and sex. Thus, a common approach in comorbidity literature is to stratify the population by age and sex, and then calculate the binomial p-value for each stratum7. In contrast, the PBC approach uses the Poisson binomial distribution to calculate p-values for each term pair. The Poisson binomial distribution is a generalization of the binomial distribution in which every trial/sample (i.e., patient) has a different probability of “success” (i.e., having both terms in their medical record). The probability of an individual patient having both terms in their medical record is calculated as the product of per-patient per-term probabilities generated from corresponding LRMs described above.
Figure 2 explores the relationships between comorbidities discovered by PBC versus stratification as a function of increasing numbers of demographic features. As can be seen, the stratification approach rapidly loses power as more criteria are added to the stratum filter. In contrast, by modeling demographic features, PBC maintains power.
Table 1 compares the power of stratification and PBC to detect three well known comorbidities using our EHR corpus. As in Figure 2, we compare the p-values generated by sequentially adding confounding variables. Notice how stratification dramatically lowers the strength of p-values as a function of the size of the stratum. This same behavior is illustrated globally in Figure 2. Even with millions of EHRs, controlling for more than a few confounding demographic variables leads to strata that are too small to achieve statistical significance. By modeling the effects of multiple confounding variables, PBC retains statistical power to identify comorbidities.
Table S3 presents five well known comorbidities of breast cancer23–28. Stratification by age and gender deflates the strength of all p-values, and as a result they fall below the Bonferroni corrected significance threshold (Table S3 column “binomial female 50-59”). In contrast, by explicitly modeling age, gender, race, ethnicity, insurance type, and EHR exposure, PBC retains power to capture these true positive associations.
The complete set of comorbid term pairs discovered by PBC can be visualized using network analysis. In supplemental Figure S3, we use the minimum description length algorithm29 to perform clustering of pairwise comorbidities by p-value strength. Terms with similar patterns of comorbidities are closer together in the network. To illustrate this, we annotated selected comorbidities discovered within each cluster. A literature search confirmed that these labeled comorbidities represent existing clinical knowledge (see corresponding citations in Table S7). Figure S3 provided the motivation to produce an interactive tool for querying, exploring and extracting information from the comorbidity network. In order to provide better means to navigate this complex network we developed a browser-based tool for exploration of comorbidities, discussed below.
PBC identifies comorbidities unique to underrepresented minority groups
PBC can also capture true comorbidities hidden within mixed populations. For instance, consider Sickle Cell Anemia, a disease that affects 1 in 365 African American newborns and 1 in 100,000 newborn Caucasians in the United States30. Malaise and fatigue, while common to many disorders, are among the most common symptoms of Sickle Cell Anemia (SCA)31–33, and usually manifest in an age dependent manner. Table 2 compares five comorbidity p-values. Row 1 presents p-values calculated using all data. While the true comorbidity is discovered, we cannot say with certainty if the relationship is a true comorbidity or simply driven by a third confounding variable (e.g. ancestry). The following rows present X2 p-values after stratifying by ancestry, ethnicity, gender, and age and PBC p-values after including these same features in the regression model. Stratification fails to find a significant comorbidity between SCA and maliase/fatigue once the data is partitioned by ancestry. The rarity of Sickle Cell disease in Caucasians and the small sample size for African Americans within Utah’s EHR corpus make detection of this comorbidity difficult. PBC, in contrast, discovers the comorbidity. In fact, the additions of ancestry, ethnicity, and age each increase the strength of the association between SCA and malaise/fatigue.
Application of PBC to a publicly available dataset
Because the University of Utah data used in this study cannot be shared publicly, as it includes protected health information (PHI), we also demonstrated the general applicability of PBC by applying it on the publicly available MIMIC-IV dataset16. This dataset includes 248,714 patients with associated ICD10 diagnosis or procedure codes. A total of 5,363,338 ICD9 and ICD10 diagnosis and procedure codes were converted to multi-level CCS codes. These include 725 distinct CCS diagnosis codes and 395 distinct CCS procedure codes. We repeated our experiments on this dataset, training a logistic regression model for each diagnosis and procedure code, and calculating comorbidities using the Poisson binomial distribution. Although the MIMIC-IV data is missing much of the detail available in the University of Utah dataset, the PBC results generally mirror those of the Utah dataset (Figure S4). Regression features are shown in Figure S4, top panel. In addition, because a random offset is added to each patient’s admission dates, it is not possible to control for the changes in use of various billing codes over time. Despite these limitations, the deployment of PBC on the MIMIC-IV public dataset further illustrates how PBC retains statistical power to identify comorbid relationships that are lost by stratification. Tables 1 and 2 are replicated on MIMIC-IV data as supplemental tables S4 and S5. Co-occurrence and directional comorbidities discovered within MIMIC-IV data can be queried at the following link: https://pbc.genetics.utah.edu/lemmon2021/pbc-mimic.
Temporalized P-values allow for understanding of disease progression
The PBC approach can be extended to provide temporalized (or directional) p-values across pre-specified time windows (see Methods for details). The inclusion of a direction window is necessary for several reasons. On short time scales, the order of appearance of diagnostic codes is an unreliable indicator of which condition actually preceded the other in the patient. The development of underlying disease, the relevant signs and symptom, the provider arriving at a given diagnosis, and the eventual recording of the said diagnosis might follow staggered paths that have little or no relevance when viewing the data in a time-slice of less than 30 days or so, thus for many analyses it is probably best to treat these events as contemporaneous. However, in some cases a short window size is optimal for capturing a comorbidity.
Consider the following example. PBC reports the following p-values for the diagnosis/procedure pair amputation of lower extremity → postoperative infection: within 30 days = 1e-3663, greater than 30 days = 1e-24, greater than 90 days = 1e-6, greater than 365 days = 0.86. It is clear that shorter window sizes better capture the increased risk of infection after amputation, as one would logically expect. Thus, the window size must be informed by clinical knowledge and research objectives. In this manuscript, we examined associations based on a 90-day window. Using our website (see below) a user can also query additional window sizes (30 days, 365 days and 730 days).
Table S6 presents specific directional associations discovered in an ab initio fashion by PBC and supported by clinical knowledge. For instance, Milrinone is prescribed to patients awaiting heart transplant34. The tendency of type 2 diabetics to develop chronic kidney disease is well known35, 36. HIV induced immunocompromisation often leads to pneumocystis37 and obesity is a known risk factor for hypertension38.
A web-based resource for comorbidity research
Among 4,623,841 pairs of medical terms in our collection of 1.6 million EHRs, we identified 3,311,830 comorbidities co-occurring within a 90-day window, and 1,969,941 temporally directed ones, acting over a time period greater than 90 days. All associations meet a Bonferroni significance threshold of 1.08e-08. The result is a highly-connected network of comorbid diagnoses and associated procedures and medications based on the University of Utah EHR database.
In response to the size and complexity of these ‘big-data’, we have created browser-based means to navigate, query and explore them (https://pbc.genetics.utah.edu/lemmon2021/pbc-utah). A screenshot highlighting the functionality of the browser is shown in Figure 3. The site allows the user to search for relationships between any pairwise diagnosis, procedure or medication. The result is a searchable table of all other DX, PX, and RX codes along with statistics about the connection to the query term. These statistics include the counts, expectation and p-value of association after adjusting for confounders shown in Figure 1. We use two-sided p-values, so that “less than expected” associations are also discoverable. For comparison, we provide both χ2 p-values and G-test p-values.
In addition to patient lifetime co-occurrence p-values, we provide within window, and out of window directional p-values with a selectable window size (30, 90, 365, or 730 day). We also provide statistics on effect size, including relative risk, odds ratio, and “flow rate” which is the percent of patients coded with term 1 who later are coded with term 2. The “Flowchart” view puts the query term in focus and shows the terms that tend to precede and follow the query term to the left and right respectively. The user can filter by p-value strength and by flow rate.
A slice of that network is shown in Figure 3. This figure contains a “flow chart” view for essential hypertension. By switching out the central node, an investigator can step through the temporalized network of diagnoses, procedures and medications. The investigator can further filter by effect size to find co-occurrences that are both significant and prevalent.
Comparison with other published comorbidity tools
Table 3 compares functionalities provided on our comorbidity website with those found in other published comorbidity discovery tools. To the best of our knowledge, our website is the only available public resource of its kind that considers the individual risk profile for each patient having each medical term. Additionally, our approach (and website) captures inverse comorbidities (terms occurring together less often than expected) and models temporal relations between pairs of terms.
The R package “comoRbidity” was published April 201810. We installed the package, and reformatted our data to fit the required specifications. Given our input of 150,598,377 CCS and RxNorm codes, the “comoRbidity” package consumes all available RAM (we used a Linux server with 504 GB of RAM) and fails to complete. We tried to acquire CytoCom11 and comoR12, however the corresponding author has indicated that these projects are no longer maintained nor available for download. The Java package “Comorbidity4j” was published January 201913. We installed Comorbidity4j and attempted to compute on our full dataset. The application warned against calculating comorbidities for more than 300,000 pairs of terms (774 distinct terms). After several hours the calculation times out - having allocated 50.2 GB of RAM. In contrast our method uses a maximum of 174 MB of RAM and has no upper limit on the number of patients, visits, or unique terms.
The Disease Trajectory Browser (DTB) allows navigation of temporal relationships among medical records of 7.2 million Danish patients14. Direct comparison between our results is not possible since our groups have access to different EHR datasets, however we can consider differences in methodology. DTB measures effect size using relative risk, significance is measured using the binomial test, and confounders are controlled for using case/control matching. For comparison, on PBC-Web, we provide relative risk estimates and χ2 p-values (which are a close approximation to the binomial test p-values). As described above, our approach models patient features rather than controlling for them through stratification or case/control matching.
Discussion
Although the term comorbidity is often used to denote significant associations between medical outcomes, (e.g., hypertension and heart attack), the concept is easily extended to include associated variables, such as medical procedures and medications. For brevity’s sake, in what follows, we refer to statistically significant associations among these collective variables as comorbidities.
Comorbidity discovery is a feature discovery/selection process, and it is important to distinguish it from outcomes prediction. Before one can understand and predict a medical outcome, one must first decide which prior diagnoses, medical procedures and medications are germane to the outcome of interest. Discovery of comorbidities is thus a prerequisite for downstream outcomes research and for creating forecasting tools39–42, but is logically separate from them.
Big data offer many opportunities and challenges for comorbidity discovery. One limitation imposed by data size is that morbidity discovery is necessarily pairwise; hence the term comorbidity, as opposed to multimorbidity discovery. Tools for multimorbidity-based discovery43 are necessarily limited in scale due to computational constraints, considering for instance, 34 disease clusters44. In contrast, we have calculated pairwise comorbidities among 37,997 ICD10 diagnosis codes.
Commonly used statistical approaches to ab initio comorbidity discovery are hindered by the assumption that every member of the population (or stratum) has a disease probability equal to the population (or stratum) incidence rate. As we have explained, stratification is commonly used to subset EHR collections to meet this requirement; but stratification necessarily reduces sample size and statistical power (c.f. Figure 2). In practice, stratified data quickly become limiting, even for very large datasets, as inclusion criteria grow more complex. This problem is exacerbated for ab initio approaches aimed at simultaneous discovery of comorbid relationships among thousands of diagnoses, procedures and medications. This process necessitates many millions of statistical tests; the requirement for multiple testing corrections mean that statistical power is of paramount importance.
Our motivation in developing PBC was to overcome the need for stratification, while still achieving high accuracy and statistical power, so as to allow discovery of comorbidities of rare diseases, using small datasets, and ab initio discovery using very large EHR collections. Our results document the efficacy of PBC for achieving these ends. Still, it is important to bear in mind, that the comorbidities it discovers do not necessarily indicate mechanistic relationships. For instance, two diagnoses may both be driven by smoking, but since smoking was not included logistic regression model, we cannot say anything about smoking as a potential driver (cause) of the relationship. Thus PBC, like all existing methods in this domain, cannot with certainty assign comorbidities to one of the four etiological models described by Valderas et al1; it can only say that the relationship is not due to factors included in the logistic regression model.
Our hope is that PBC will provide an effective solution for a foundational step for outcomes research. The curse of dimensionality is a well-known phenomenon in which training a predictor with too many features can lead to higher error rates45, 46. Considering there are about 69,000 ICD10 diagnosis codes, 70,000 ICD10 procedure codes, and 350,000 RxNorm CUI codes, dimensionality reduction is necessary for effective machine learning on EHRs. By discovering which variables influence which outcome, PBC can reduce dimensionality and facilitate the creation of downstream tools for outcome predictions. Thus, PBC’s role in feature selection becomes clear. For a given clinical outcome, PBC can produce a manageable set of pairwise associations which become the inputs for training predictive models of disease.
It is important to note the limitations inherent in the use of data from a single EHR for comorbidity discovery. Data from a single EHR represents a non-random sampling of the general population and PBC does not model this sampling bias. Billing practices can vary within hospital systems - for instance, between inpatient and outpatient services and between provider billing and hospital billing. Hospital billing is performed by medical billing specialists, whereas provider billing is performed by clinicians. In this paper, we have restricted our analysis to provider billing terms. Clinical notes provide a still richer, more nuanced source of data and may more accurately describe a patient’s medical condition. Clearly application of PBC to the outputs of Natural Language Processing (NLP) tools will be a fruitful path for future research.
Conclusion
Capobianco and Lio47 present a vision for comorbidity discovery and analysis that is multi-disciplinary and enabled by dynamic networks, with time as a key component in explaining disease relationships. We share this vision, and our PBC method directly addresses the challenges for creating a scalable network-based approach that can (1) dynamically adjust for confounding demographic variables and (2) model temporal relationships in large, complex EHR datasets. However, comorbidities do not exist as isolated pairs, rather they combine in a conditionally dependent manner to create a complex web of influence on any given outcome. While PBC is powered to discover that web by identifying the major drivers of a particular outcome, determining the joint contributions of conditionally dependent variables on that outcome requires a separate computational machinery. Bayesian networks48–49, for example can be used to compute the joint contributions of multiple conditionally dependent variables (so-called multimorbid calculations), providing fully explainable patient outcome predictions.
Obtaining a global overview of comorbidity and disease progressions across a major research hospital network is as difficult as it is desirable. We offer the PBC web-browser as a first-generation navigation tool for this new domain of EHR database visualization. Our hope is that the PBC website will provide a community resource for outcomes research, laying the foundation for improving current comorbidity-based outcomes tools, creation of new ones, and, more generally, fueling healthcare discovery for improved care.
Methods
University of Utah medical records
The University of Utah maintains an Electronic Data Warehouse (EDW) – a central storage and search facility for all data collected from all university hospitals and clinics, and all departments and specialties. SQL queries were composed to the following information: (1) medical record number, sex, race, ethnicity, and age for each patient; (2) list of patient visits, along with visit date, and medical terms associated with each visit, including diagnostic codes, procedure codes, and medications ordered. Data were deidentified.
We collect ICD917 and ICD1018 diagnosis codes CPT procedural codes20 and RXNorm21 medication codes (“concept unique identifiers”) from University of Utah electronic medical records. ICD and CPT diagnosis and procedure codes were mapped to the hierarchical Clinical Classification System (CCS)19. CCS codes allow for more powerful statistics at the expense of concept resolution. After mapping to CCS, we retain 1007 distinct diagnosis codes and 259 distinct procedure codes. These codes include both internal nodes and leaf nodes in the hierarchical CCS tree. In all, we collected records for 1.6 million patients, 50 million visits and 150 million diagnosis (DX), procedure (PX) and medication (RX) codes.
Counts of EDW patient demographics are displayed in Table S1. Figure S1 displays how our data is distributed by gender and age decade. Figure S1, panel B shows how the length of patient medical records (in years) is distributed. Note that these lengths are limited by the history of electronic data collection at the University of Utah that began in the early 2000s but started to ramp up around 2009 and has since increased rapidly. Thus, we see the 95th percentile for medical history length is around 12-15 years for most age bins. Figure S1 panels C and D show how the number of visits and the number of terms in a medical record trend with age. In almost all decades, women have more medical visits and medical terms than men, though this effect is most pronounced between 20 and 50.
Logistic regression for person-term probabilities
The initial step in comorbidity analysis is to ascertain the probability of a given term being found in a given person’s medical record. A naïve approach could assign everyone the same probability based on the term’s frequency in the database or within each age-gender strata as seen in other methods.
Our approach involves developing a logistic regression model for each term. The independent features, X, are the list of persons in the EHR along with their gender, race, ethnicity, financial class and risk exposure. Risk exposure includes the age of a person at the time of their last visit as well as the length and density of their medical record. Length is defined as the number of days between first visit and last visit, while density is approximated by the number of visits within a medical record. The dependent outcome, y, is a binary vector indicating whether each person has the term in their medical record. The value C is the inverse of regularization strength in L2 penalized logistic regression. Smaller values of C indicate stronger regularization. The coefficients are determined by minimizing the following loss function, where β represents the coefficients and c is a constant (see Scikit-learn documentation50 for a more complete discussion of logistic regression): For each term, stratified 3-fold cross-validation is used to determine the optimal value for C within the set {10-14, 10-13, 10-12, …, 1012, 1013, 1014}. Cross-validation relies on a scoring function to assess the accuracy of logistic regression, given differing values for C. We evaluated standard and custom score functions based on their ability to differentiate logistic regression results with differing C values (see Figure S2).
The above approach resulted in logistic regression models for each term, capable of predicting the probability that a given person has a given term. For rare terms, we find that the probabilities output by logistic regression may not sum up to the actual number of patients with the term. To account for this we adjust each probability by a bias correction factor such that the sum of the adjusted patient probabilities is equal to the actual number of patients with the term. The exact form of this correction factor is given in the supplemental “Math.pdf”.
A limitation of logistic regression is the lack of a confidence metric on each predicted probability output by the regression model. For instance, predicted probabilities might be more accurate for patients of western European than African ancestry, because the corpus data is skewed for this demographic variable (see Table S1). To overcome this limitation, we divide our data into 6 randomized partitions, balanced so that the number of affected individuals in each partition is approximately equal. This partitioning is accomplished using “StratifiedKFold” from the python package sklearn46. Next each partition is used to fit a logistic regression model, using the previously determined regularization strength. Each model is used to predict the probability of each person having the term. These 6 probabilities are used to calculate a sample variance for each person-term probability, , where 𝓂 represents a patient’s medical record, 𝓉 represents a medical term, and P𝓉∈𝓂 is the probability of term 𝓉 in 𝓂. To be clear these 6 LRMs are only used to calculate sample variance, while the LRM trained on the full dataset is used per-patient per-term probability predictions.
Poisson Binomial for term pair p-values
Our null hypothesis is that pairs of medical terms are independently distributed in University of Utah medical records, H0: 𝓉1 ⊥ 𝓉2|ℳ. A significant p-value would indicate that a pair of terms co-occur more often than would occur by chance. Given two independent terms, 𝓉1 ⊥ 𝓉2 the probability the two would occur by chance in a given person’s medical record 𝓂, is the product of their individual probabilities: A naïve approach assumes person-term probabilities are equal to population incidence rates: Using the naïve approach, person-term-pair probabilities follow the binomial distribution with However, using logistic regression, we have different probabilities for each person/term pair. The Poisson binomial distribution is the discrete distribution of a sum of Bernoulli trials where the probability of each trial differs. Thus, using logistic regression, our data follows a Poisson binomial distribution.
Because the cumulative distribution function (CDF) of a Poisson binomial is computationally tractable only for a small number of values, numerous approximations have been developed52. We use the normal approximation because it is fast and accurate for large datasets. To determine a p-value using the normal approximation, the mean and variance for the Poisson binomial are calculated and used as parameters for a normal distribution. The mean of a Poisson binomial represents the expected number of University of Utah patients who will have in their medical record both terms in the pair and is calculated as the sum of probabilities for each person 𝓂: The variance of a Poisson binomial is likewise similar in form to a binomial distribution: The Poisson binomial variance is augmented with the logistic regression variances described in the previous section using the product rule and the law of total variance. One can think of these variances as measurement error for P𝓉1⊆𝓂 and P𝓉2⊆𝓂 and they are larger for rare terms.
The probability that a person has a pair of terms, P{𝓉1,𝓉2}⊆𝓂, can be so rare it exceeds the limits of floating-point arithmetic. Thus, we implement our methods in log-space. Our logistic regression models report per-patient per-term log probabilities. We calculate ln(μ𝓉1,𝓉2) and rather than summing in normal space. We use a numerical approximation to calculate the ln(pvalue) of a normal distribution53 (implemented as “gsl_sf_log_erfc” in the Gnu scientific library54). To report significance, we use throughout this paper an alpha threshold of 0.05. Since we calculate p-values for 4,622,320 pairs of medical terms, our Bonferroni corrected alpha is set to 1.08e-08.
Direction P-values
Given two terms that occur together in medical records more often than would occur by chance, which term tends to occur first in the medical record, or do they tend to occur in the same time frame? We calculate p-values for the temporal nature of each association. For each patient with medical record m, the date of the first occurrence of each term in m is recorded. Pairs of terms occurring within a window of size W are labeled as “in-window”. Pairs of terms occurring outside of W contribute to the 𝓉1 → 𝓉2 count or the 𝓉2 → 𝓉1 count. For the analyses presented here, we chose a 90-day window, as this duration decreases noise associated with the date of information capture within the medical record, but the approach is valid over any interval. Note that some term relationships – such as a surgery procedure followed by an infection diagnosis – will only show a significant direction with a window smaller than 90 days. PBC-Web includes 4 window sizes (30, 90, 365, 730).
For a person with medical record 𝓂, containing terms 1 and 2 ({𝓉1, 𝓉2} ⊆ 𝓂), the probability that term 1 occurs before term 2 is a function of the ratio of the probabilities of the 2 terms:
Given {𝓉1, 𝓉2} ⊆ 𝓂, the probability 𝓉1 and 𝓉2 occur within a time window of size W, is a function of the span or length of a person’s medical history, Spanm:
The above formula represents the percent of timepoints t1 and t2 that fall within W days of each other. The derivation of the formula is given in the supplemental “Math.pdf”.
The product of (7) and (8) gives the probability term 1 would precede term 2 within a window of size W: Similar logic can be applied to derive the “out-of-window” probability of term 1 occurring at least W days before term 2. A more complete explanation is found in supplemental “Math.pdf”. Direction p-values are calculated using a normal approximation of the Poisson Binomial CDF as in the previous section.
Code, web-development, and calculations
We implement our statistical analysis using Python, Cython and C. Cython is a static compiler for Python and the extended Cython programming language55. Scikit-learn51 was used for Logistic regression studies. All the figures accompanying this article were generated using Matplotlib56. Our website is built using the Flask web framework57. The backend is pure python and the front end is JavaScript and D358. Logistic regression modeling and pairwise calculation of Poisson Binomial p-values were performed at the University of Utah Center for High Performance Computing (CHPC) PHI protected environment. Training 3041 logistic regression models while tuning the regularization strength with cross validation took 1959 CPU hours. Maximum memory usage was 174 MB. Calculating comorbidity statistics for 4,623,841 term pairs took 7455 CPU hours while maximum memory usage was 87 MB.
Data Availability
In this paper we calculate comorbidity statistics for all pairs of medical billing codes - including diagnoses, procedures, and medications. All of these p-values are available to query and download from the following link: https://pbc.genetics.utah.edu/lemmon2021
Author Contributions
Gordon Lemmon is the senior research associate leading PBC development and validation. Sergiusz Wesolowski is an applied mathematician who has helped formalize our approach to statistical testing. Alex Henrie was a software engineer on the project. Martin Tristani-Firouzi and Mark Yandell conceived of the project and secured research funding and played a key role in scientific discussions regarding development of PBC. All authors edited the manuscript.
Competing interests
GL, MY own shares in Backdrop Health, a University of Utah effort to commercialize Bayesian inference on health records. However, there are no financial ties regarding this research.
Data Availability
In this paper we calculate comorbidity statistics for all pairs of medical billing codes – including diagnoses, procedures, and medications. All of these p-values are available to query and download from the following link: https://pbc.genetics.utah.edu/lemmon2021.
Code Availability
We provide a CodeOcean capsule with code and data; the link is submitted by the editor to the reviewers during the peer-review process.
Supplemental Display Items
Acknowledgements
The following collaborators have provided valuable discussion, feedback, and insight which has guided development of PBC: Bruce Bray, Vikrant Deshmukh, Karen Eilbeck, Edgar Javier Hernandez, Rashmee Shah. We thank members of the University of Utah EDW for facilitating access to medical records. The computational resources used were partially funded by the NIH Shared Instrumentation Grant 1S10OD021644-01A1.
This research was supported by the AHA Children’s Strategically Focused Research Network grant (17SFRN33630041) and the Nora Eccles Treadwell Foundation. Gordon Lemmon was supported by NRSA training grant T32H757632. Sergiusz Wesolowski was supported by NRSA training grant T32DK110966-04 and the AHA Children’s Strategically Focused Research Network Fellowship award (17SFRN33630041).