ABSTRACT
Objectives Firstly, we aimed to develop a system capable of detecting multiple cardiac abnormalities simultaneously from 12-lead ECG recordings. Secondly, we tried to improve the detection by analyzing the relationship between imbalanced datasets and optimal classification thresholds.
Methods A novel fusion of Convolution Positional Encoder and Transformer Encoder was used to solve the multi-label classification problem. We used a proper evaluation metric called area under the precision-recall curve (AUPRC) that enabled us to analyze the precision-recall trade-off and find the optimal thresholds.
Results Having outperformed other popular deep networks, the model achieved the highest AUPRC of 0.96 and f1-score of 0.90 on 42511-sample datasets. We also found the negative correlation coefficient of -0.68 be-tween optimal thresholds and the proportion of positive samples.
Significance &Conclusion This study compared the performances of different deep learning architectures on a medical problem and showed the potential of advanced techniques in capturing spatial, temporal features alongside attention mechanisms. It also introduced how to reduce the impact of imbalanced datasets and find optimal classification thresholds.
1. INTRODUCTION
Cardiovascular diseases have been increasingly common in modern society nowadays, thus it is imperative to develop techniques to deal with those. One of the most effective tools to analyze and diagnose such diseases is the electrocardiogram (ECG), and making automatic diagnoses from ECG recordings has become vitally important. Traditionally, there were several techniques that used machine learning to automatically diagnose cardiovascular diseases. For example, [1–3] employed Support Vector Machine (SVM) [4] and Decision Tree [5]. However, conventional machine learning algorithms require a manual feature engineering process. It is often time-consuming and a number of potential features in the ECG signals may be neglected. To deal with this problem, modern deep learning methods have been introduced to automatically extract a far larger number of latent features in the data. For example, [6–9] used Convolutional Neural Network (CNN) [10] and Long Short Term Memory (LSTM) [11] to capture the spatial and temporal information from the signals. However, these studies assumed that the abnormalities were mutually exclusive, i.e., one recording had no more than one abnormality, whereas in practice, patients can suffer from multiple cardiac abnormalities simultaneously. In addition, a wide range of research works analyzed the performance of one deep neural network on different datasets, thus there was a lack of comparisons of multiple neural network models on the same dataset, which could give us insights into how they would perform on medical data. Besides, medical datasets are often imbalanced, i.e., the proportion of negative samples are far higher than that of positive ones, thus it is vital to use proper methods for assessment. Previous studies such as [6–8,12,13] used f1-score [14] to evaluate their models. Though f1-score can fairly take positive and negatives samples into consideration, it does not consider the whole threshold range [0, 1]. Instead, it uses the default value of 0.5: if the output probability is lower than that value, then its original signal will be classified as negative, and otherwise. There may exist an optimal threshold value that balances the precision-recall trade-off and maximizes f1-score, along-side the correlation between that optimal value and the proportion of positive samples.
Major contributions of this work are as follows:
We developed a model that could detect multiple cardiac issues simultaneously, rather than considering mutually exclusive abnormalities.
We trained and tested a fusion of recent advancements in Deep Learning: Residual Convolution for exploring spatial features; Positional Encoder which is able to capture temporal information in parallel, and Transformer Encoder to utilize attention mechanisms. The model outperformed other well-known architectures in detecting all abnormalities.
We evaluated our model in the whole classification threshold range [0, 1], and found optimal threshold value for each cardiac abnormality. We also showed the negative correlation between the optimal threshold and the proportion of positive samples.
2. EXPLORATORY DATA ANALYSIS
We used three public datasets for the PhysioNet Challenge 2020 [15, 16], which were collected from multiple sources in three different continents. The first source was the dataset from the China Physiological Signal Challenge 2018 (CPSC2018) [17], Southeast University, China. It has a total of 10,330 recordings (5,542 males, 4,788 females), sampled at 500 Hz, with varying durations from 10 to 60 seconds. The second one was from the Physikalisch Technische Bundesanstalt (PBT-XL) [18, 19], Brunswick, Germany, with a total of 21,837 recordings (11,379 males and 10,458 females), sampled at 500 Hz for 10 seconds. The last data source was the Georgia 12-Lead ECG Challenge Database [20], collected at Emory University, Atlanta, Georgia, USA. This dataset consists of 5,551 male and 4,793 female recordings, yielding a total of 10,344. They were also sampled for 10 seconds at 500 Hz. In total, there were 42,511 recordings in our datasets.
To be compatible with deep neural networks, all data samples must have the same length. Since the CPSC2018 dataset has variable-length recordings, we either padded or truncated some recordings, to increase or reduce their length to a target one. In the three datasets, 35,948 out of 42,511 recordings last 10 seconds, amounting to more than 84%. Therefore, we chose 10-second as the target duration, corresponding to the sequence length of 10 × 500 = 5000. All recordings that were longer than 10 seconds were truncated evenly on both sides. For example, if a recording lasted 60 seconds from 0 to 59, we kept the middle 10-second sub-recording from 25 to 34 and eliminated the rest.
We trained our model to detect 13 cardiac types: a normal class (Normal Sinus Rhythm - NSR) and 12 ab-normalities, namely 1st Degree AV Block (IAVB), Abnormal QRS (abQRS), Atrial Fibrillation (AF), Left Anterior Fascicular Block (LAnFB), Left Axis Deviation (LAD), Left Bundle Branch Block (LBBB), Left Ventricular Hypertrophy (LVH), Myocardial Infarction (MI), Myocardial Ischemia (MIs), Right Bundle Branch Block (RBBB), Sinus Bradycardia (SB) and Sinus Tachycardia (STach). Table 1 shows the dataset statistics.
3. METHODS
This section gives details about the proposed CNN-TE model, whose block diagram is shown in Figure 1. We describe the role of each module in the diagram as follows.
3.1 Convolution
Convolutional Neural Networks have been widely used in Computer Vision and even Natural Language Processing applications, such as object detection, object recognition [21, 22], sentiment analysis [23]. The key components of such networks are convolutional layers, which capture spatial information and expand the feature representation of the input tensors. In other words, it reduces the spatial dimension and increases the feature dimension. Thus these layers should perform well on tasks related to processing images (2D array) or time-series signals (1D array).
In this study, we built a convolutional module which mainly consisted of 8 residual blocks [24], as illustrated in Figure 2. Given 12-lead ECG recordings as its input in the format of a tensor with shape (batch size, sequence length, num features) = (256, 5000, 12), this block reduced the sequence length (spatial dimension) from 5000 to 157, and increased the feature dimension from 12 to 512, to output a tensor with shape (256, 157, 512).
3.2. Positional Encoder
ECG recordings can be considered to be time-series data, thus they might contain latent temporal information. It would be beneficial if such information could be captured. Recurrent Neural Networks (RNN) such as Gated Recurrent Units (GRU) [25] or Long Short Term Memory (LSTM) [11] have been widely used for that purpose. However, those recurrent models learn temporal representation from the data sequentially, i.e, they must have the result at timestep t −1 available to compute the result at timestep t, thus require a long training time. In this study, the Positional Encoder [26] was employed to inject the temporal information of the input recordings into themselves. This module is capable of capturing the information in parallel.
Let X ∈ Rl×d denotes an input recording of the block, where l is the length of the recording, d is the number of features at each position. The Positional Encoder encodes the position P ∈ Rl×d of X and outputs P + X. The position P is computed by sine and cosine functions as follows:
Here, i = 0, …, l − 1 represents the position of each element in the recording, j = 0, …, ⌊(d−1)/2⌋ refers to the feature dimension. Specifically, after the convolutional module, the recording length was l = 157 and the number of features was d = 512. Since Pi,2j and Pi,2j+1 are independent of other components at i −1, i −2, …, they can be computed in parallel, resulting in faster training time. The input and output of the positional encoder were added, then passed to the Transformer Encoder.
3.3 The Transformer Encoder
3.3.1 Multi-head Attention
Attention mechanism [27, 28], based on the idea in cognitive science that we should focus on some key parts when processing a huge amount of information, is one of the most powerful techniques in Deep Learning nowadays. Mathematically, suppose that an attention layer has a “memory”, in which past information has been stored in the form of n key-value pairs: (k1, v1), …, (kn, vn), with . Given an input , called a query, the attention layer returns an output that has some parts being focused more than others, based on the following procedure.
Firstly, the layer computes the similarity s = (s1, …, sn) of the input query q with each key k1, …, kn by a scoring function S:
The scoring functions S are often either dot-product or multi-layer perceptron [29]. The Transformer architecture uses scaled dot-product attention (DPA) [26], which is defined by:
Here the query q and the key ki have the same dimension dq = dk = d, ⟨q, ki⟩ is the dot-product of the two d-dimensional vectors q and ki. The dot-product is then scaled by a factor of to prevent vanishing gradients when d is large [26].
Then, the similarities are passed through a softmax function to produce attention weights w = [w1, …, wn]T : or
Finally, the attention layer’s output o is determined by the weighted sum of the values:
In short, the DPA can be described as: where Q ∈ Rn×d, K ∈ Rn×d, are arrays of query, key and value vectors.
The Transformer used self-attention mechanisms, in which the query, key, and value are identically taken from each element of the input sequence. Figure 3 shows the architecture of a multi-head attention layer. Instead of processing the entire d-dimensional queries, keys, and values, the Transformer Encoder split the feature dimen-sion equally into h DPA blocks called heads, projects the queries, keys, and values with 3 Linear layers, performs the attention function parallelly, then concatenate the results from each head, and eventually passes the concatenated result to a final Linear layer. This mechanism enables the model to focus on different feature subspaces at different positions. Moreover, the output can be computed in parallel, thus the multi-head attention layer should be more efficient than sequential-computing recurrent neural networks.
In this study, the input queries, keys, and values of the multi-head attention block were duplicates of the convolution’s output after being positional encoded. Then, the three’s feature dimension would be equally split for h = 2 parallel attention heads, each had the same feature dimension dq = dk = dv = d/ = 512/2 = 256. The results of each head were concatenated, thus the output still had the shape of (256, 157, 512).
3.3.2 Add & Normalization
The Add&Norm blocks also play a key role in the model. It uses residual connection [24] to add the inputs after dropout [30] of the Multi-head Attention block and the Position-wise Feed Forward Network to their respective outputs. Then the sums are normalized over the feature dimension by Layer Normalization [31], which standardizes the features in one training example, to facilitate faster training and better generalization.
3.3.3 Position-wise Feed Forward Network
This block consists of two fully connected linear transformations with a ReLU activation function [32] in between. It accepts 3-dimensional input tensors of shape (batch size, sequence length, num features), and only applies to the feature dimension. In other words, the block performs on each position in the recording length independently (position-wise). It outputs a tensor with the same shape as its input.
3.4 The output layer
After the Transformer Encoder, the model used a Dense layer with 13 hidden units with a softmax activation to generate the positive probabilities of 13 classes.
4. EXPERIMENTAL SETTINGS
In our experiments, we split the dataset into two subsets for training and testing, with a ratio of 9:1. With regard to hardware, we trained the models on an NVIDIA Tesla P100 GPU, repeatedly during 50 iterations (epochs). In each iteration, we split the training set into batches of 256 recordings each. To mitigate the undesired effect from the very high imbalance dataset, we employed oversampling [33] for the minority positive class of each abnormality. This served to increase the ratio of positives and negatives in each batch. In addition, the model parameters were initialized by a Xavier initializer [34], and optimized by Adam algorithm [35]. The learning rate of the Adam optimizer was set to 0.001 and was scheduled during training by cosine annealing warm restart [36], to quickly start the optimization process, then gradually decrease the learning rate to find the optimal solution. Finally, we used several regularization techniques to deal with overfitting, such as batch normalization [37], dropout [30], weight decay [38].
5. EVALUATION METRIC
The output layer of the model generates a positive probability for each abnormality. Then by defining a threshold, the probability can be used to classify its input into either positive or negative:
There are two types of error that a model can make in a classification problem: false positive and false negative. Precision and recall can be used to evaluate the two errors:
The formulae show that precision covers the falsepositive error, while recall takes false-negative cases into consideration. However, the two metrics cannot cover the other error type. In addition, the dataset was highly imbalanced, i.e., the ratio of positive and negative samples was very low for every abnormality, thus the model can be biased toward the majority negative class, having good precision and poor recall. Therefore, it is important to use a metric that is more balanced. One metric that can be used is f1-score, which is the harmonic mean of precision and recall, thus can better assess model performances.
However, when using the precision, recall, and f1score, a classification threshold, 0.5 by default, must be specified in advance. It is better to evaluate models across the whole threshold range [0, 1] and enables us to find the optimal threshold value for each class. The area under receiver operating characteristics (AUROC) [39] and area under precision-recall curve (AUPRC) [40] are two such metrics. Nonetheless, [41] showed that the AUROC was likely to give misleading results when used on imbalanced datasets. Therefore, in this study, the AUPRC was used to evaluate our model. The threshold range [0, 1] was split into a large number of discrete values. For each value, we calculated the corresponding precision, recall, and f1-score, then plot a point that represents the precision at the y-axis and recall at the x-axis. Repeat the process for all threshold values results in a smooth curve, and the area under the curve is the AUPRC of the model, which also ranges from 0 to 1. The optimal threshold can be obtained by sorting the f1-score array.
6. RESULTS & DISCUSSIONS
In this work, we compared the performance of our proposed CNN-TE model with five different deep neural networks:
VGG11 [42]: this model consisted of convolutional and fully connected layers, only capable of extracting spatial features.
GRU [25] and LSTM [11]: the two popular RNNs for time-series tasks, only capable of extracting temporal information.
CNN-GRU and CNN-LSTM: these stacked versions of convolutional and recurrent layers are able to extract both spatial and temporal features.
Table 2 shows the AUPRC of those models on 13 cardiac types in the test set. The CNN-TE outper-formed other models in detecting all abnormalities, with AUPRC surpassed 0.8 on NSR, RBBB, STach, AF, MI, and LBBB. Visually, Figure 4 illustrates the precision- recall curves of all six models for those abnormalities. The curve of the CNN-TE was represented in blue, and it is clear that the area under the blue curve was larger than the areas under other curves. Meanwhile, CNN-GRU and CNN-LSTM had approximately the same performance. When using separately, LSTM was better than GRU, and the VGG11 performed poorly in most classes. Since the VGG11 was outperformed by GRU and LSTM, it could be inferred that temporal features in the ECG signals were much more important than spatial ones for classification. Additionally, stacking convolutional layers at the front significantly improved the RNNs’ performance, as shown in CNN-GRU and CNN-LSTM. Meanwhile, the CNN-TE model was not only capable of exploiting both types of features but also had intrinsic attention mechanisms, thus achieved the best results.
Our evaluation method allowed the optimal threshold that maximized the f1-score of a class to be obtained. Table 3 shows optimal classification threshold for each cardiac types, with the corresponding precision, recall and f1-score. All optimal thresholds were larger than the naive value of 0.5. which was expected for datasets that are heavily imbalanced towards the negative class. The Pearson product-moment correlation coefficient of -0.68 between the optimal thresholds and the percentage of positive samples indicated their strong negative correlation. In other words, the optimal threshold for a class tended to increase when there were fewer positive samples of that class.
7. CONCLUSION
In this study, we tried to tackle the multi-label classification problem in a medical context, detecting 13 cardiac types simultaneously from 12-lead ECG recordings. Specifically, we developed the combination of Convolution, Positional Encoder to explore spatial and temporal features in the signals, alongside a Transformer Encoder to utilize its multi-head attention mechanisms. We also took a proper evaluation method into consideration, by using a more comprehensive metric called AUPRC to assess model performance on the whole classification threshold range [0, 1]. Based on this method, we drew a comparison of how different deep learning models performed, obtained optimal threshold values for all cardiac types, then related it to the proportion of positive class. The results showed the effectiveness of our CNN-TE models that outperformed other networks. In addition, the strong negative correlation between the optimal threshold and the positive ratio indicated the significance of not relying on the naive threshold of 0.5 for medical classification tasks, which mostly involve heavily imbalanced datasets.
Data Availability
All data produced are available online at the Physionet website.