Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020

The subject of the PhysioNet/Computing in Cardiology Challenge 2020 was the identification of cardiac abnormalities in 12-lead electrocardiogram (ECG) recordings. A total of 66,405 recordings were sourced from hospital systems from four distinct countries and annotated with clinical diagnoses, including 43,101 annotated recordings that were posted publicly. For this Challenge, we asked participants to design working, open-source algorithms for identifying cardiac abnormalities in 12-lead ECG recordings. This Challenge provided several innovations. First, we sourced data from multiple institutions from around the world with different demographics, allowing us to assess the generalizability of the algorithms. Second, we required participants to submit both their trained models and the code for reproducing their trained models from the training data, which aids the generalizability and reproducibility of the algorithms. Third, we proposed a novel evaluation metric that considers different misclassification errors for different cardiac abnormalities, reflecting the clinical reality that some diagnoses have similar outcomes and varying risks. Over 200 teams submitted 850 algorithms (432 of which successfully ran) during the unofficial and official phases of the Challenge, representing a diversity of approaches from both academia and industry for identifying cardiac abnormalities. The official phase of the Challenge is ongoing.


Introduction
Cardiovascular disease is the leading cause of death worldwide (Benjamin et al 2019). Early treatment can prevent serious cardiac events, and the most important tool for screening and diagnosing cardiac electrical abnormalities is the electrocardiogram (ECG) (Kligfield et al 2007, Kligfield 2002. The ECG is a non-invasive representation of the electrical activity of the heart that is measured using electrodes placed on the torso. The standard 12-lead ECG is widely used to diagnose a variety of cardiac arrhythmias such as atrial fibrillation and other cardiac anatomy abnormalities such as ventricular hypertrophy (Kligfield et al 2007). ECG abnormalities have also been identified as both short-and long-term mortality risk predictors (Mozos andCaraba 2015, Gibbs et al 2019). Therefore, the early and correct diagnosis of cardiac ECG abnormalities can increase the chances of successful treatments. However, manual interpretation of ECGs is time-consuming and requires skilled personnel with a high degree of training.
The automatic detection and classification of cardiac abnormalities can assist physicians in making diagnoses for a growing number of recorded ECGs. However, there has been limited success in achieving this goal (Willems et al 1991, Shah andRubin 2007). Over the last decade, the rapid development of machine learning techniques have also included a growing number of 12-lead ECG classifiers (Ye et al 2010, Ribeiro et al 2020, Chen et al 2020. Many of these algorithms may identify cardiac abnormalities correctly. However, most of these methods are trained, tested, or developed in single, small, or relatively homogeneous datasets. In addition, most methods focus on identifying a small number of cardiac arrhythmias that do not represent the complexity and difficulty of ECG interpretation.
The PhysioNet/Computing in Cardiology Challenge 2020 provided an opportunity to address these problems by providing data from a wide set of sources with a large set of cardiac abnormalities (Goldberger et al 2000, PhysioNet Challenges 2020, PhysioNet/Computing in Cardiology Challenge 2020a. The PhysioNet Challenge is an initiative that invites participants from academia, industry, and elsewhere to tackle clinically important questions that are either unsolved or not well-solved. Similar to previous years, the Challenge had both an unofficial phase and an official phase that ran over the course of several months. PhysioNet co-hosts the Challenge annually in cooperation with the Computing in Cardiology conference. The goal of the 2020 PhysioNet Challenge was to identify clinical diagnoses from 12-lead ECG recordings. We asked participants to design and implement a working, open-source algorithm that can, based only on the clinical data provided, automatically identify any cardiac abnormalities present in a 12-lead ECG recording. Like previous years, we facilitated the development of the algorithms through the Challenge but did little to constrain the algorithms themselves. However, we required that each algorithm be reproducible from the provided training data. The winners of the Challenge are the team whose algorithm achieved the highest score for recordings in the hidden test set. We developed a new scoring function that awards partial credit to misdiagnoses that result in similar treatments or outcomes as the true diagnosis or diagnoses as judged by our cardiologists because traditional scoring metrics, such as common area under the curve (AUC) metrics, do not explicitly reflect the clinical reality that some misdiagnoses are more harmful than others and should be scored accordingly.

Data
For the PhysioNet/Computing in Cardiology Challenge 2020, we assembled multiple databases from across the world. Each database contained recordings with diagnoses and demographic data.

Challenge data sources
We used data from five different sources. Two sources were split to form training, validation, and test sets; two sources were included only as training data; and one source was included only as test data. These sources of ECG data are described below and summarized in table 1. We made the training data and clinical ECG diagnoses (labels) publicly available, but the validation and test data were kept hidden. The training, validation and test data were matched as closely as possible for age, sex and diagnosis. The completely hidden dataset has never been posted publicly, allowing us to assess common machine learning problems such as overfitting.
(a) CPSC.  (d) G12EC. The fourth source is the Georgia 12-lead ECG Challenge (G12EC) Database, Emory University, Atlanta, Georgia, USA. This is a new database, representing a large population from the Southeastern United States, and is split between the training, validation, and test sets. The validation and test set comprised the hidden G12EC set. (e) Undisclosed. The fifth source is a dataset from an undisclosed American institution that is geographically distinct from the other dataset sources. This dataset has never been (and may never be) posted publicly, and is used as a test set for the Challenge.

Challenge data variables
Each 12-lead ECG recording was acquired in a hospital or clinical setting. The specifics of the data acquisition depend on the source of the databases, which were assembled around the world and therefore vary. We encourage the readers to check the original publications for details but provide a summary below. Each annotated ECG recording contained 12-lead ECG signal data with sample frequencies varying from 257 Hz to 1 kHz. Demographic information, including age, sex, and a diagnosis or diagnoses, i.e. the labels for the Challenge data, were also included. The quality of the label depended on the clinical or research practices, and the Challenge included labels that were machine-generated, over-read by a single cardiologist, and adjudicated by multiple cardiologists. Table 2 provides a summary of the age, sex, and recording information for the Challenge databases, indicating differences between the populations. Table 3 and figure 1 provide summaries of the diagnoses for the training and validation data. The training data contain 111 diagnoses or classes. We used 27 of these 111 diagnoses to evaluate participant algorithms because they were relatively common, of clinical interest, and more likely to be recognizable from ECG recordings. Table 3 contains the list of the scored diagnoses for the Challenge can be seen in table 3 with long-form descriptions, the corresponding Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) codes, and abbreviations. Only these scored classes are shown in table 3 and figure 1, but all 111 classes were included in the training data so that participants could decide whether or not to use them with their algorithms. The test data contain a subset of the 111 diagnoses in potentially different proportions, but each diagnosis in the test data was represented in the training data.
All data were provided in MATLAB-and WFDB-compatible format (Goldberger et al 2000). Each ECG recording had a binary MATLAB v4 file for the ECG signal data and an associated text file in WFDB header format describing the recording and patient attributes, including the diagnosis or diagnoses, i.e. the labels for the recording. We did not change the original data or labels from the databases, except (1) to provide consistent and Health Insurance Portability and Accountability Act (HIPAA)-compliant identifiers for age and sex, (2) to add approximate SNOMED CT codes as the diagnoses for the recordings, and (3) to change the amplitude resolution to save the data as integers as required for WFDB format. Saving the signals as integers helped reduced storage size and compute times without degrading the signal, as it only represents a change in the scaling factor for the signal amplitude.

Challenge objective
We asked participants to design working, open-source algorithms for identifying cardiac abnormalities in 12-lead ECG recordings. To the best of our knowledge, for the first time in any public competition, we required that teams submit code both for their trained models and for training their models, which aided the generalizability and reproducibility of the research conducted during the Challenge. We ran the participants' Table 2. Number of recordings, mean duration of recordings, mean age of patients in recordings, sex of patients in recordings, and sample frequency of recordings for each dataset. Italicized dataset names indicate that the database is a subset of the source dataset above it. The training, validation and test data were matched as closely as possible for age, sex and diagnosis.  trained models on the hidden validation and test data and evaluated their performance using a novel, expert-based evaluation metric that we designed for this year's Challenge.

Challenge overview, rules, and expectations
This year's Challenge is the 21st PhysioNet/Computing in Cardiology Challenge (Goldberger et al 2000). Similar to previous Challenges, this year's Challenge had an unofficial phase and an official phase. The unofficial phase (February 7, 2020 to April 30, 2020) provided an opportunity to socialize the Challenge and seek discussion and feedback from teams about the data, evaluation metrics, and requirements. The unofficial phase allowed five scored entries for each team. After a short break, the official phase (May 11, 2020 to August 23, 2020) introduced additional training, validation, and test data; a requirement for teams to submit their training code; and an improved evaluation metric. The official phase allowed 10 scored entries for each team. During both phases, teams were evaluated on a small validation set; evaluation on the test set CPSC (6877) CPSC-Extra (3453) St. Petersburg (74) PTB (490) PTB-XL (21837) G12ECG (10344) Validation (6630) IAVB (2946) AF (4026) AFL (423) Brady (289) CRBBB (701) IRBBB (1817) LAnFB (1916) LAD (6564) LBBB (1197) LQRSV (748)  occurred after the end of the official phase of the Challenge to prevent sequential training on the test data. Moreover, while teams were encouraged to ask questions, pose concerns, and discuss the Challenge in a public forum, they were prohibited from discussing their particular approaches to preserve the uniqueness of their approaches for solving the problem posed by the Challenge.

Classification of 12-lead ECGs
We required teams to submit both their trained models along with code for training their models. We announced this requirement at the launch of this year's Challenge but did not start requiring the submission of training code until the official phase of the Challenge; by this time, we had a better idea of what teams would need to train their algorithms. Teams included any processed and relabeled training data in the training step; any changes to the training data are part of training a model. We first ran each team's training code on the training data and then ran each team's trained code from the previous step on the hidden validation and test sets. We ran each algorithm sequentially on the recordings to use them as realistically as possible.
We allowed teams to submit either MATLAB or Python implementations of their code. Other languages, including Julia and R, were supported but received insufficient interest from participants during the unofficial phase. Participants containerized their code in Docker and submitted it in GitHub or Gitlab repositories. We downloaded their code and ran in containerized environments on Google Cloud. The computational environment is given more fully in Reyna et al (2019), which describes the previous year's Challenge.
We used virtual machines on Google Cloud with 8 vCPUs, 64 GB RAM, and an optional NVIDIA T4 Tensor Core graphics processing unit (GPU) with a 72 hour time limit for training on the training set. We used virtual machines on Google Cloud with 2 vCPUs, 13 GB RAM, and an optional NVIDIA T4 Tensor Core GPU with a 24 hour time limit for running the trained classifiers on the test set.
To aid teams, we shared baseline models that we implemented in MATLAB and Python. The Python baseline model was a random forest classifier that used age, sex, QRS amplitude, and RR intervals as features. QRS detection was implemented using the Pan-Tompkins algorithm (Pan and Tompkins 1985). The MATLAB baseline model was a hierarchical multinomial logistic regression classifier that used age, sex, and global electrical heterogeneity (Waks et al 2016) parameters as features. The global electrical heterogeneity parameters were computed using a time coherent median beat and origin point calculation (Perez-Alday et al 2019). The QRS detection and RR interval calculations were implemented using the heart rate variability (HRV) cardiovascular research toolbox (Vest et al 2018, Vest et al 2019. However, it was not the aim of these example models to provide a competitive classifier but instead to provide an example of how to read and extract features from the recordings.

Evaluation of classifiers
For this year's Challenge, we developed a new scoring metric that awards partial credit to misdiagnoses that result in similar outcomes or treatments as the true diagnoses as judged by our cardiologists. This scoring metric reflects the clinical reality that some misdiagnoses are more harmful than others and should be scored accordingly. Moreover, it reflects the fact that it is less harmful to confuse some classes than others because the responses may be similar or the same.
Let C = {c i } m i=1 be a collection of m distinct diagnoses for a database of n recordings. First, we defined a multi-class confusion matrix A = [a ij ], where a ij is the normalized number of recordings in a database that were classified as belonging to class c i but actually belong to class c j (where c i and c j may be the same class or different classes). Since each recording can have multiple labels and each classifier can produce multiple outputs for a recording, we normalized the contribution of each recording to the scoring metric by dividing by the number of classes with a positive label and/or classifier output. Specifically, for each recording k = 1, . . . , n, let x k be the set of positive labels and y k be the set of positive classifier outputs for recording k. We defined a multi-class confusion matrix A = [a ij ] by where The quantity |x k ∪ y k | is the number of distinct classes with a positive label and/or classifier output for recording k. To incentivize teams to develop multi-class classifiers, we allowed classifiers to receive slightly more credit from recordings with multiple labels than from those with a single label, but each additional positive label or classifier output may reduce the available credit for that recording. Next, we defined a reward matrix W = [w ij ], where w ij is the reward for a positive classifier output for class c i with a positive label c j (where c i and c j may be the same class or different classes). The entries of W are defined by our cardiologists based on the similarity of treatments or differences in risks (see figure 2). The highest values of the reward matrix are along its diagonal, associating full credit with correct classifier   table 3. Off-diagonal entries that are equal to 1 indicate similar diagnoses that are scored as if they were the same diagnosis. Each entry in the table was rounded to the first decimal place due to space constraints in this manuscript, but the shading of each entry reflects the actual value of the entry.
outputs, partial credit with incorrect classifier outputs, and no credit for labels and classifier outputs that are not captured in the weight matrix. Also, three similar classes (i.e. PAC and SVPB, PVC and VPB, CRBBB and RBBB) are scored as if they were the same class, so a positive label or classifier output in one of these classes is considered to be a positive label or classifier output for all of them. However, we did not change the labels in the training or test data to make these classes identical to preserve any institutional preferences or other information in the data. Finally, we defined a score for each classifier as a weighted sum of the entries in the confusion matrix. This score is a generalized version of the traditional accuracy metric that awards full credit to correct outputs and no credit to incorrect outputs.
To aid interpretability, we normalized this score so that a classifier that always outputs the true class or classes receives a score of 1 and an inactive classifier that always outputs the normal class receives a score of 0, i.e.
where s inactive is the score for the inactive classifier and s true is the score for ground-truth classifier. A classifier that returns only positive outputs will typically receive a negative score, i.e. a lower score than a classifier that returns only negative outputs, which reflects the harm of false alarms. Accordingly, this scoring metric was designed to award full credit to correct diagnoses and partial credit to misdiagnoses with similar risks or outcomes as the true diagnosis. The resources, populations, practices, and preferences of an institution all determine the ideal choice of the reward matrix W; the choice of W for the Challenge is just one example.

Results
We received a total of 1395 submissions of algorithms from 217 teams across academia and industry. The total number of successful entries was 707, with 397 successful entries during the unofficial phase of the Challenge and 310 successful entries during the official phase. During the official phase, we scored each entry on the validation set. The final score and ranking were based on the test set. A total of 70 teams' codebases successfully ran on the test data. After final scoring, 41 teams were able to qualify for the final rankings (PhysioNet/Computing in Cardiology Challenge 2020b). Reasons for disqualification included: the training algorithm did not run, the trained model failed to run on the hidden undisclosed set (because of differences in sampling frequencies), the team failed to submit a preprint on time, the team failed to attend Computing in Cardiology (remotely or in person) and defend their work, and the team failed to submit their final article on time or address the reviewers' comments. Figure 3 shows the performance of each team's final algorithm on the validation set, the hidden CPSC set, the hidden G12EC set, the hidden undisclosed set, and the test set. The line colors from red to blue indicate higher to lower scores on the test set. We observed the difference in score between each set. The higher scores were observed in the hidden CPSC dataset which contained a larger number of recordings in the training set as compared to the other three hidden dataset. We can also observe a drop on scores for the hidden undisclosed set for which no recording was included in the training or validation sets. . Ranks of the final 70 algorithms that were completely evaluated on the validation set, the hidden CPSC set, the hidden G12EC set, the hidden undisclosed set, and the test set. Lines from top to bottom indicate the rank of each individual algorithm on each dataset. Rank is indicated by color coding, with red indicating the best ranked algorithms, blue indicating the worst ranked algorithm on the test set, and gray indicating disqualified algorithms. Figure 4 shows the ranked performance of each team's final algorithm on the validation set, the hidden CPSC set, the hidden G12EC set, the hidden undisclosed set, and the test set. The points indicate the rank of each individual algorithm on each dataset. The line colors indicate the ranks on the test set.
On average, the Challenge scores dropped 47% from the hidden CPSC set to the hidden G12EC set and another 57% from the hidden G12EC set to the hidden undisclosed set. We observed an average drop of 50% from the validation score set to the test set.
The most common algorithmic approach was based on deep learning and convolutional neural networks (CNNs). However, over 70% of entries used standard clinical or hand-crafted features with classifiers such as support vector machines, gradient boosting, random forests, and shallow neural networks. The median training time was 6 h, 49 min; nearly all approaches that required more than a few hours for training used deep learning frameworks. Figures 3 and 4 show how the performance of participant entries dropped on the hidden set. This under-performance on the hidden undisclosed dataset, and to a much lesser extent, on the hidden G12EC dataset could be due to the most teams over-trained on the CPSC data. The hidden CPSC data included fewer recordings than the other hidden sets. The poorer scores and ranks demonstrate the importance of including multiple sources for generalizability of the algorithms.

Discussion
Deep learning approaches are one of the most popular machine learning techniques for classification problems, especially those of images. Some participants adapted previously developed algorithms for other classification problems and therefore this modification does not necessarily perform better than a custom-made machine learning algorithm.
It is important to note the class imbalance between the datasets, but the larger number and varying prevalences of diagnoses in different datasets represent the real-world problem of reading 12-lead ECGs in a clinical setting. In fact, most teams performed best on the CPSC dataset, which was the least representative dataset because it had fewer and more balanced diagnoses than the other datasets. Moreover, the scoring function that we proposed and used to evaluate the performance of each algorithm penalized classes non-uniformly, based on clinical importance. Balancing data would not only be artificial, but would provide an advantage to teams because the prevalence of the class would then be known. The Challenge was designed to discourage the use of a priori information on distributions, since the algorithms are likely to be used in a variety of unknown populations. Moreover, racial inequities and genetic variations are likely to lead to substantially different performances. While we cannot address that directly because the populations in the databases are not strictly matched, there is the potential to evaluate long-standing unknowns in algorithms that have been traditionally developed on predominately white, western hemisphere populations. (We note that the training, validation, and test data were matched as closely as possible for age, sex and diagnosis.) In future Challenges, we will re-use these databases and reveal per-class performances in the hidden test data to allow full evaluations of the algorithms in terms of class, age, race, and gender.

Conclusions
This article describes the world's largest open access database of 12-lead ECGs, together with a large hidden test database to provide objective comparisons. The data were drawn from three continents with diverse and distinctly different populations, encompassing 111 diagnoses with 27 diagnoses of special interest for the Challenge. Additionally, we introduced a novel scoring matrix that rewards algorithms based on similarities between diagnostic outcomes, weighted by severity/risk.
The public training data and sequestered validation and test data provided the opportunity for unbiased and comparable repeatable research. Notably, to the best of our knowledge, this is the first public competition that has required the teams to provide both their original source code and the framework for (re)training their code. In doing so, this creates the first truly repeatable and generalizable body of work on the classification of electrocardiograms.