Abstract
The electrocardiogram (ECG) is an almost universally accessible diagnostic tool for heart disease. An ECG is measured by using an electrocardiograph, and today’s electrocardiographs use built-in software to interpret the ECGs automatically after they are recorded. However, these algorithms show limited performance, and therefore clinicians usually have to manually interpret the ECG, regardless of whether an algorithm has interpreted the ECG or not. Manual interpretation of the ECG can be time-consuming and require specific skills. Therefore, a better algorithm is clearly needed to make correct ECG interpretations more accessible and time efficient. Algorithms based on artificial intelligence have shown promising performance in many fields, including ECG interpretation, over the last few years and might represent an alternative to manual ECG interpretation.
In this study, we used a dataset with 88253 12-lead ECGs from multiple databases, annotated with SNOMED-CT codes by medical experts. We employed a supervised convolutional neural network with an Inception architecture to classify 30 of the most frequent annotated diagnoses in the dataset. Each patient could have more than one diagnosis, which makes this a multi-label classification. We compared the Inception model’s performance while applying different preprocessing methods on the ECGs and different model settings during 10-folded cross-validation. We compared the model’s classification performance using binary cross-entropy (BCE) loss and double soft F1 loss. Furthermore, we compared the classification performance when downsampling the original sampling rate of the input ECG. Finally, we trained 30 interpretable linear models to provide class activation maps to explain the relative importance of each sample in the ECG with respect to the 30 diagnoses considered in this study.
Due to the heavily imbalanced class distribution in our dataset, we placed the most emphasis on the F1 score when evaluating the performance of the models. Our results show that the best performance in terms of F1-score was seen when the Inception model used double soft F1 as the loss function and ECGs downsampled to 75Hz. This model achieved an F1 score of 0.420 ± 0.017, accuracy = 0.954 ± 0.002, and an AUROC score of 0.832 ± 0.019. An aggregation of the generated saliency maps, achieved using Local Interpretable Model-Agnostic Explanations (LIME), showed that the Inception model paid the most attention to the limb leads and the augmented leads and less importance to the precordial leads.
One of the more significant contributions that emerge from this study is the use of aggregated saliency maps to obtain ECG lead importance for different diagnoses. In addition, we emphasized the relevance of evaluating different loss functions, and in this specific case, we found double soft F1 loss to be slightly better than BCE. Finally, we found it somewhat surprising that downsampling the ECG led to higher performance compared to the original 500Hz sampling rate. These findings contribute in several ways to our understanding of the artificial intelligence-based interpretation of ECGs, but further studies should be carried out to validate these findings in other datasets from other patient cohorts.
1 Introduction
Cardiovascular disease (CVD) is one of the leading causes of death worldwide. Numbers from World Health Organization estimate that 17.9 million people died from CVD in 2016 which represented 31% of all global deaths that year [1]. Early detection of patients with a risk of CVD could potentially reduce the severity of the disease and also decrease the number of persons who die from CVD.
Electrocardiography is a non-invasive and widely used method to record electrocardiograms (ECG), which has enabled clinicians to interpret, diagnose and prognosticate heart disease since the beginning of the 20th century [2]. The ECG is the result of a measurement of the electrical activity of the heart by recording the voltage potential from electrodes placed on the patient’s skin. Electrocardiography is generally easier to set up and more cost-effective compared to other diagnostic methods such as echocardiogram and magnetic resonance imaging of the heart. On the other hand, one of the challenges is that the ECG can be difficult to interpret correctly. Correct interpretation can be time-consuming and require a high degree of expertise [3].
In the 1950s it became possible to convert analog ECG signals to digital format and this led to the development of digital interpretation algorithms in the 1960s [4]. Today, most of the modern and clinically used electrocardiographs are equipped with built-in interpretation software. The software interprets the ECG and prints interpretive texts that may indicate different pathologies. Studies show that there are several limitations to the automatic interpretation algorithms [5, 4]. The errors, done by the automatic interpretation algorithms, imply that doctors have to read over the ECGs to ensure that the diagnosis is correct. This is time-consuming for the doctors and leads to high interpretation variability. Thus, there is a need for developing a better ECG interpretation algorithm, since this may lead to less time-consuming interpretation for the doctors, less variability in the interpretation and better diagnostic performance which may lead to earlier detection and treatment of patients with CVD.
In the past decades, several new important trends have converged and may potentially be ushering in a new age with significance to ECG interpretation. Firstly, ECGs are now increasingly stored in digital format, allowing computerized analysis of massive data sets. Secondly, personal sensors such as training monitors and smartwatches (e.g., Apple Watch, Withings Watch, Samsung Galaxy Watch) now include simple ECG recording abilities, further expanding access to ECGs and the range of people studied. Finally, artificial intelligence (AI) or more specifically deep learning (DL) has shown remarkable abilities in classifying signal data [6], and more specifically also ECG data [7, 8, 9, 10, 11, 12, 13].
Despite the good performance of the DL-based ECG interpretation models, the doctors are still responsible for the diagnosis, and such models should then be considered as decision support tools, but the complexity in DL models makes the decision inaccessible to humans, often referred to as the black box phenomenon [14]. This has led to the development of another sub-field within AI, explainable AI (XAI) [15], with the aim of making the model decision more human-interpretable. XAI methods such as Gradient class activation map (GradCAM) [16], LIME [17] and SHAP [18] have already been used to get class activation maps, showing which part of the raw ECG waveform was most important for the DL model’s prediction. The majority of these studies have focused on explaining single-label classification models [19, 20, 21, 22], while only a few have explained multi-label classification models [23, 24]. In one study, the researchers discovered novel disease-specific ECG features in Phospholamban (PLN) mutation carriers [19], but for many other diseases, the DL model will likely rely on very subtle patterns and combinations of features from different leads, and even though these get highlighted and displayed to the doctors in the class activation map, it might be hard for them to understand or verify the relationship between the features used by the DL-model.
This study builds on the George Moody Challenge 2020 [25] and 2021 [26] where the objective was to perform multi-label classification of cardiovascular diagnoses using the raw ECG waveform. We contributed to the 2020 edition of the George Moody challenge by combining convolutional neural networks and rule-based algorithms [27] and in the 2021 edition, we used classifier chains based on convolutional neural networks [28]. In the current study, we compare inception models trained using BCE and double soft F1-loss and show how the sampling frequency of the ECG records affects the classification performance. Furthermore, we use explainable AI techniques to investigate which of the 12 leads has the highest importance when classifying different diagnoses.
2 Methods and materials
2.1 Data
We used ECG data from seven different open-access databases [26, 25, 29, 30, 31, 32, 33], with a total of 88253 12-lead ECGs in waveform format. All ECGs different from 10 seconds in recording length were excluded, and 81327 ECGs were used for further development and validation as Figure 1 shows. Each ECG was stored in a .mat file and had a corresponding .hea file containing metadata such as the ECG recording length, sample frequency, the patient’s age, gender and diagnosis. There was a total of 133 different experts annotated diagnoses in the dataset, but in this study, we choose to consider only 30 of them (the same 30 used in George Moody PhysioNet Challenge 2021 [26]). The prevalence of each of these 30 diagnoses are shown in Table 1. Each patient could have more than one of the 30 diagnoses at the same time, which makes this task a multi-label classification task with more than 3000 different combinations of diagnoses among the patients in the dataset.
2.2 Preprocessing
2.2.1 ECG processing
More than 85% of all ECGs in the development set were initially sampled at 500 Hz. All ECGs were resampled to the same sample frequency. In this study, we compared the model’s performance when the signal was downsampled from 500 to 400, 300, 200, 100, 75, 50 and 25Hz.
2.2.2 Label processing
We one-hot encoded all the 30 diagnoses considered in this study, such that each ECG recording had a corresponding 30-bit long array of ones or zeros. A binary one means that the patient has the given diagnosis and zero means that the patient does not have the diagnosis.
2.3 CNN architecture
We developed a CNN model inspired by the Inception architecture [34] as shown in Figure 2 using TensorFlow [35]. The input to this model was an array, representing the raw ECG. The array containing the ECG signal can be denoted as:
The output layer of the model had 30 neurons, corresponding to the 30 scored diagnoses. A Sigmoid activation was used in the final layer, giving a continuous number between 0 and 1 for each of the 30 diagnoses.
2.4 Loss function
A loss function is used to compute the error of the prediction made by the model during the training phase. The computed errors are used to adjust the weight coefficient in the model using backpropagation [36]. A previous study claimed that different variations of F soft loss could be beneficial when performing multi-label classification with imbalanced classes [37]. In this study, we compared the Inception model using double soft F1 loss and binary cross-entropy loss. Equation 2 shows how double soft F1-loss is calculated. The small number (+10−16) is added to the denominator to prevent the function to divide by zero. An Adam optimizer was used to compute the gradients, based on the loss, which was backpropagated to update the weights of the artificial neurons in the Inception model [38].
2.5 Training and validation
The model was trained and evaluated on the dataset using 10-fold stratified cross-validation. The data were stratified based on the prevalence of the diagnoses to ensure similar distribution of diagnoses in both the train and validation fold. The models were trained for 15 epochs, with a batch size of 30 and a learning rate of 0.001.
The model performance was scored using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, F1-score and average accuracy across all classes (hereby just referred to as accuracy). Equation 3 shows how we compute accuracy by comparing the true label (y) and the predicted label (ŷ) for each ECG recording, ns and then finding the average accuracy for each class c and finally taking the average across all classes, nc.
2.6 Explainability
To find the relative importance of the features in the ECGs, a local interpretive model-agnostic explanation (LIME) model was trained to fit the input data (ECGs) to the output predictions from the Inception model. A LIME model is a linear surrogate model which is easier to interpret compared to a deep neural network. One LIME model was trained and tested for each of the 30 classes. As output, a LIME model provides a class activation map that has an equal shape to the input. Values close to zero in the class activation map mean low activation, while higher values mean higher activation. Figure 3 shows an example of one ECG-lead (aVL), visualized in the same plot as the corresponding activation map for that specific lead. The LIME model is here trained to explain the atrial fibrillation classification from the Inception model. The dark red lines indicate high activation, while the brighter color indicates lower activation. This example is what’s called a local explanation because it only explains the model’s behavior on a single input data, whereas global explanations are used to explain the model’s behavior on the whole population.
2.6.1 Developing the LIME model
We trained a LIME model for each of the 30 classes using the training data from the 10th cross-validation split during the model development. For each LIME model, 1000 ECGs labeled with the class to explain and 1000 ECGs labeled with classes different from the one to explain were used to train the LIME model. The trained LIME models were then applied on the ECGs in the validation split of the 10th cross-validation split. The n-th LIME model was applied on all ECGs labeled with the n-th class in the validation split.
3 Results
3.1 Loss function
The box plots in Figure 4 compare the performance of the Inception model trained using BCE loss with the Inception Time model trained using double soft F1-loss. Each box represents the ten values achieved during the 10-folded cross-validation. Figure 4a shows the achieved accuracy, Figure 4b shows the F1-score and Figure 4c shows the AUROC score.
Due to the heavily imbalanced dataset used in this study, we selected the loss function that achieved the best F1-score, which was double soft F1-loss.
3.2 Sampling frequency
In order to assess the impact of the ECGs sampling frequency on the classification results, we took eight copies of the original dataset and resampled the datasets to eight different sample frequencies (25Hz, 50Hz, 75Hz, 100Hz, 200Hz, 300Hz, 400Hz, 500Hz). Eight Inception models, using double soft F1-loss, were trained and validated using 10-fold CV. All eight models were trained for 15 epochs, with a batch size of 30, using Adam as the optimizer and an initial learning rate of 0.001. In Figure 5 we compare the cross-validated accuracy, F1-score and AUROC score obtained by the eight models.
3.3 Explainability
Finally, an Inception model, with a double soft F1-score as the loss function was trained on ECG signals resampled to 75Hz. The Inception model was trained on the training data from the 10th split of the 10-folded CV and tested on the validation fold. 30 LIME models were trained and tested for each of the 30 diagnoses. Figure 6 shows a saliency map of the ECG leads with the highest activation/importance for each of the 30 diagnoses. The saliency map was obtained by finding the ECG lead with the highest average activation value for each of the 30 diagnoses in the 10th validation fold.
4 Discussion
This study demonstrates an Inception-type convolutional neural network doing multi-label classification on an imbalanced ECG dataset. Additionally, we also employed an explainable AI technique called LIME in order to find the ECG lead with the highest class activation for each of the 30 diagnoses considered in this study. To the best of our knowledge, this is the first time class activation maps have been used to determine ECG lead importance for different diagnoses.
During development of the Inception model, we compared two different loss functions, BCE and double soft F1-loss. We found that double soft F1-loss gave a significantly better F1-score (Figure 4b), which is considered the most important metric in a heavily imbalanced dataset such as the one used in this study. It is however somewhat surprising that the model using BCE loss achieved better accuracy and AUROC score than the model using double soft F1-loss. A plausible explanation seems to be that the BCE model was good at classifying the major classes, giving a high accuracy score, and the model using double soft F1-loss was generally good at classifying all 30 classes, giving a high F1-score.
One of the most surprising findings in this study was the improvement in classification performance when downsampling the original 500 Hz ECGs. As seen in Figure 5 both the accuracy (Figure 5a) and F1-score (Figure 5b) performance reach their peak around 75Hz. The increase in classification performance using downsampled ECG signals could be a bit counter-intuitive since one would expect the ECG to lose a lot of important information. A possible explanation for this is that there is an ideal ratio between convolution kernel size and the features in the ECG, such as P-waves T-waves and QRS complex. However, we also did some experiments by increasing the kernel size, but this did not give the same improvement in classification performance as lowering the ECG sampling frequency. Nonetheless, this needs more research to reach a concusion.
The saliency map in Figure 6 is an aggregation of all class activation maps achieved from the 10th cross-validation fold. The figure shows that lead II is the lead with the highest overall activation across all diagnoses, and the precordial leads (V1-V6) generally show low activation. A possible explanation for this is that lead II often has a high signal-to-noise ratio compared to the other leads. In addition, the majority of diagnoses considered in this study are arrhythmias and these diagnoses are generally diagnosed by looking at the limb leads by human interpreters also.
4.1 Augmentation
Augmentation has shown promising effects in various image classification tasks [39]. Therefore we hypothesized that augmentation might have a good effect on signal and ECG classification as well. More specifically, we tried to add random noise and baseline wander to the ECG signals. However, these augmentations did not significantly improve the performance of our models.
The random noise was induced by adding a random number (N) in the range of the ± standard deviation (σ) of all values in the current ECG recording, shown in Equation 4. Baseline wander was induced to the signal by adding a cosine wave from 0 to 2π and shifting the cosine wave randomly between 0 and 2π. The amplitude of the signal was randomly set by multiplying a random number (N) drawn from the distribution of all values in an ECG recording, shown in Equation 5. Figure 7a shows an example of an unprocessed ECG and Figure 7b shows the same ECG with added random noise using the method described in Equation 4. Figure 7c shows the ECG in Figure 7a with simulated baseline wander as described in Equation 5.
4.2 Limitation
One key limitation of the results in this work is that we did not test the model on a separate and independent test set. The selection of optimal loss function and sampling rate could therefore have resulted in overfitting to the present dataset. However, the datasets used to train and validate the model are from different hospitals across the world and is therefore likely to represent a great diversity of patients, but anyway, validation on an external test set is needed to control the model for potential overfitting that could have occurred in our study.
In order to create the class activation maps and the aggregated lead importance diagram (Figure 6) we used the LIME framework. One of the limitations of this approach is that the LIME module we used was originally intended to be used on recurrent neural networks and not convolutional neural networks as done in this study. Secondly, previous research has stated that methods such as LIME are too generic and should be used with care on waveform data [40]. Comparing activation maps from LIME with model-specific explanation methods, such as Grad-CAM [16], would therefore be interesting
4.3 Future perspectives
Future studies should consider other loss functions than binary cross-entropy when training neural network-based multi-label classification models on imbalanced datasets. Also, one should assess different ECG sampling frequencies to get optimal performance. In terms of model explainability, future studies should let medical doctors or cardiologists verify the ECG activation maps to assess the usefulness of the XAI.
5 Conclusion
The primary aim of this study was to train a multi-label ECG classifier to achieve the best possible performance, given the unbalanced dataset score. Furthermore, we used this model to obtain class activation maps and based on those we found the leads that were considered the most important for each diagnosis. We also found that double soft F1-loss might improve the performance when classifying heavy imbalanced datasets. In addition, we observed that reducing the sampling frequency of the ECG from 500 Hz to around 75 Hz increased the model performance.
Data Availability
All data produced are available online at: https://github.com/Bsingstad/Post-George-Moody-challenge-2020-2021
https://github.com/Bsingstad/Post-George-Moody-challenge-2020-2021
Code availability
The complete source code described in this paper is openly available on GitHub (https://github.com/Bsingstad/Post-George-Moody-challenge-2020-2021) under a free software license (CC-BY 4.0).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest concerning the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Norwegian Research Council (grant number: #309762 - ProCardio).
Acknowledgment
We would like to thank Antony M. Gitau for proofreading and giving constructive criticism of the manuscript.
Footnotes
b.j.singstad{at}fys.uio.no; morenzomuten{at}ieee.org