ABSTRACT
Background Transthoracic echocardiography (TTE) is the primary modality for diagnosing aortic valve stenosis (AVS), yet it requires skilled operators and can be resource-intensive.
Objectives To develop and validate an artificial intelligence (AI)-based system for evaluating AVS that is effective in both resource-limited and advanced settings.
Methods We created a dual-pathway AI system for AVS evaluation using a nationwide echocardiographic dataset (developmental dataset, n=8,427): 1) a deep learning (DL)-based AVS continuum assessment algorithm using limited 2D TTE videos, and 2) automating conventional AVS evaluation. We performed internal (internal test dataset [ITDS], n=841) and external validation (distinct hospital dataset [DHDS], n=1,696; temporally distinct dataset [TDDS], n=772) for diagnostic value across various stages of AVS and prognostic value for composite endpoints (cardiovascular death, heart failure, and aortic valve replacement)
Results The DL index for the AVS continuum (DLi-AVSc, range 0-100) increases with worsening AVS severity and demonstrated excellent discrimination for any AVS (AUC 0.87-0.99), significant AVS (0.93-0.97), and severe AVS (0.97). A 10-point increase in DLi-AVSc was associated with an 85% increased risk for composite endpoints in ITDS and a 53% and 59% increase in DHDS and TDDS, respectively. Automatic measurement of conventional AVS parameters demonstrated excellent correlation with manual measurement, resulting in high accuracy for AVS staging (98.2% for ITDS, 81.0% for DHDS, and 96.8% for TDDS) and comparable prognostic value to manually-derived parameters.
Conclusions The AI-based system provides accurate and prognostically valuable AVS assessment, suitable for various clinical settings. Further validation studies are planned to confirm its effectiveness across diverse environments.
1. INTRODUCTION
Medical advancements have significantly increased life expectancy, with about 10% of the global population over 60, projected to double by 2050.1 This aging demographic notably increased the incidence of degenerative diseases like aortic valve stenosis (AVS). Studies revealed that 12.4% of individuals aged 75 and older have some degree of AVS, with severe cases at 3.4%.2 Untreated AVS can cause irreversible myocardial damage, characterized by left ventricular hypertrophy, fibrosis, and functional impairment, leading to increased morbidity, mortality, and socioeconomic burden.3 Therefore, timely detection and management of AVS are essential to mitigate its severe consequences.
Transthoracic echocardiography (TTE) is the primary imaging modality for assessing AVS. Accurate identification and staging of AVS via TTE require advanced expertise in scanning and interpretation, often unavailable in a general community healthcare setting. Even in tertiary care centers, the process is time-consuming and labor-intensive, involving multiple measurements, calculations, and precise interpretation. These complexities highlight the need for innovative solutions that simplify AVS assessment. Such solutions would be particularly beneficial in settings with limited resources by using fewer TTE videos and in more advanced settings by automating the measurement and interpretation processes.
To meet these clinical needs and advance beyond existing research,4–6 we developed a comprehensive artificial intelligence (AI)-based system to evaluate AVS, suitable for both resource-limited and advanced settings. This system uses deep learning (DL) to diagnose and assess AVS from limited 2-dimensional (2D) TTE videos. Importantly, it does not merely classify the AVS severity but is designed to reflect the disease’s progressive continuum. Simultaneously, the system automatically measures a broad spectrum of structural and hemodynamic parameters, facilitating the conventional calculation of the aortic valve area (AVA) and providing a quantitative assessment of AVS. This paper describes the development process of our AI-based system and evaluates its diagnostic and prognostic potential in assessing AVS.
2. METHODS
2.1. Study Population and Data Sources
The AI-based frameworks utilized in this study were developed and validated using the Open AI Dataset Project (AI-Hub) dataset, an initiative supported by the South Korean government’s Ministry of Science and ICT.7 This dataset consists of 30,000 echocardiographic examinations retrospectively collected from five tertiary hospitals between 2012 and 2021, covering a wide range of cardiovascular diseases.(Supplemental Methods 1) The AI-based frameworks introduced here were all developed using data extracted from the AI-Hub dataset.8–10 To develop the DL-based AVS continuum assessment algorithm, a key focus of this study, we assembled the Development Dataset (DDS) by deliberately excluding Severance Hospital data among five hospitals. Instead, data from Severance Hospital were used exclusively for external validation (Distinct Hospital Dataset, DHDS). Further external validation was conducted using data collected from Seoul National University Bundang Hospital in 2022 (Temporally Distinct Dataset, TDDS). Detailed methodologies for data utilization in developing and validating the AI-based system are in Supplemental Methods 1. As a result, the DDS comprised TTE images from 8,427 patients, while the DHDS included 1,696 patients, and the TDDS included 772 patients. The study followed the Declaration of Helsinki (as revised in 2013). The institutional review board of each hospital approved this study and waived the requirement for informed consent because of the retrospective and observational nature of the study design. All clinical and echocardiographic data were fully anonymized before data analysis.
2.2. Echocardiogram Acquisition and Interpretation
All echocardiographic studies were conducted by trained echocardiographers or cardiologists and interpreted by board-certified cardiologists specialized in echocardiography. These reports adhered to the recent guidelines11,12 and were part of routine clinical care. The parameter values in these reports were used as ground truth labels without additional measurements. In the DDS, AVS presence and severity were determined using these values following the standard clinical criteria (Table 1).11 In the DHDS and TDDS, the prior clinician’s decision regarding AVS severity in the clinical report was used to reflect actual clinical practice.
2.3. AI-Based System
We have developed a fully automated AI-based framework that addresses AVS evaluation through the dual pathway, leveraging innovative and conventional methodologies. (Central Illustration) The operational sequence of this system begins by automatically selecting the necessary views, including the parasternal long-axis (PLAX), parasternal short-axis (PSAX) at the aortic valve (AV) level, AV continuous wave (CW) and pulsed wave (PW) Doppler, and left ventricular outflow tract (LVOT) PW Doppler. In the DL-based AVS continuum assessment pathway, the algorithm evaluates AVS using only the PLAX and PSAX videos. Concurrently, the DL segmentation network generates masks for each view in the automated conventional AVS assessment pathway. These masks facilitate the measurement of LVOT diameter from the PLAX view and analyze spectral Doppler images to ascertain key indicators such as AV peak velocity (Vmax), AV velocity time integral (VTI), AV mean pressure gradient (mPG), and LVOT VTI. Then, the system calculates AVA, enabling quantitative evaluation of AVS. This dual approach (DL-based AVS continuum assessment and automated conventional AVS assessment) has the potential to support both resource-limited and advanced settings.
2.3.1. View Classification
To assess AVS, we improved our preexisting view classification algorithm.8 The algorithm could already identify the PLAX view, PSAX at the AV level, AV CW Doppler from apical views, AV PW Doppler, and LVOT PW Doppler. We augmented it to recognize the PLAX-AV zoomed views and the AV CW Doppler obtained from the right parasternal view. Detailed information about this development is in Supplemental Method 2.
2.3.2. DL-based AVS Continuum Assessment Algorithm
Our objective was to develop a network that classifies AVS severity in a way that reflects its continuum nature rather than just discrete categories. We used 3-dimensional (3D) convolutional neural networks (CNNs; r2plus1d18) as a backbone to separate spatial and temporal filters.13 (Supplemental Methods 3) This network processes input videos from PLAX and PSAX at the AV level to output a score predicting the AVS severity, entitled the DL index for the AVS continuum (DLi-AVSc). To achieve accurate classification reflecting the AVS continuum, we implemented two strategies: 1) continuous mapping with ordered labels and 2) multi-task learning with auxiliary tasks that predict numeric parameters indicative of the AVS continuum, such as AV Vmax, mPG, and AVA. Conventional multi-class classification with cross-entropy loss was unsuitable for reflecting the AVS continuum as it fails to capture the disease’s progressive nature due to equidistance between one-hot encoded severity levels. Instead, the continuous approach assigns each severity level a value between 0 and 1 (e.g., Normal: 0, Sclerosis: 0.25, Mild: 0.5, Moderate: 0.75, Severe: 1) and trains the model by minimizing negative Bernoulli likelihood LBernoulli. While this method reflects AVS progression, it primarily converts discrete labels into continuous values. To truly capture the continuum and enable nuanced transitions within and between severity levels, we incorporated three auxiliary tasks predicting TTE parameters based solely on 2D TTE videos. These tasks, predicting Vmax, mPG, and AVA, provide rich information content, allowing the network to learn anatomical features and the motion of the AV. The loss function for each auxiliary task is the mean squared error (MSE) between the predicted and actual TTE parameter values: . Training the network to predict continuous TTE parameters allows it to capture both discrete transitions and subtle variations within each severity category. For instance, it can distinguish between cases classified as “moderate” closer to mild AVS and those nearing severe AVS. The combined loss function integrates the negative Bernoulli likelihood and the MSE losses for the auxiliary tasks where λ is a weighting parameter balancing the contributions of the classification and regression tasks. Detailed network configurations and implementation details are in Supplementary Methods 3.
2.3.3 Automated Conventional AVS Assessment Algorithm
Our AI-based system also automates the conventional method to calculate AVA and assess AVS severity. Automating conventional AVA assessment in our system involves three key steps: 1) segmentation of anatomical structures and spectral Doppler envelopes, 2) uncertainty quantification to assess the confidence of the predicted segmentation masks, 3) post-processing algorithms to extract clinical measurements from segmentation masks.
We had previously developed and validated algorithms for analyzing spectral Doppler by segmenting the Doppler envelope to capture velocity profiles with essential topological features.9,10 This approach automatically measures AV Vmax, AV VTI, and LVOT VTI by segmenting Doppler envelopes in every analyzable cycle in all provided images. In this study, to quantify AVA, we further developed a DL network based on the SegFormer transformer architecture to measure the LVOT diameter in the PLAX view.14 This advanced model can segment all anatomical structures visible in the PLAX view, including the left ventricle (LV), LV septum and posterior wall, left atrium, right ventricle, aorta, and even the mitral valve and AV. Detailed information is provided in Supplemental Methods 4 and Videos S1.
Deep segmentation networks are highly effective due to their ability to learn complex patterns and features from large datasets. However, quantifying uncertainty in their predictions is crucial because segmentation errors can impact subsequent post-processing for automatic measurement. To address this, we used predictive entropy from the segmentation network’s probability map, which combines two sources of uncertainty: lack of knowledge in DL (epistemic uncertainty) and poor data quality (aleatoric uncertainty).15 By evaluating the predictive entropy, cases requiring manual review due to poor image quality or model uncertainty can be identified. Detailed methodologies are provided in Supplemental Method 5 and Videos S2.
In the post-processing stage, the segmented masks were utilized to extract clinical measurements. From the predicted segmentation mask, we identified points where the mitral valve intersects with the aorta and where the septum intersects with the aorta to determine annulus points. Considering the differing opinions on the appropriate location for measuring the LVOT diameter,16 our algorithm was designed to measure the LVOT diameter at three different locations: at the annulus, 2.5mm, and 5mm away from the annulus towards the LV cavity. In this study, the measurements taken at the annulus were used for analysis as they showed the highest agreement with the ground truth. For technical details and performance information, please refer to Supplemental Method 6 and Video S1.
For spectral Doppler images, AV Vmax and VTI were derived from the segmented Doppler envelope of AV CW Doppler. This analysis included AV CW Doppler obtained from both the apical and right parasternal views, selecting the largest envelope across all cycles in all images to obtain AV Vmax and VTI. The LVOT PW Doppler analysis also spanned all cycles, using the average value of LVOT VTI to avoid overestimating LVOT flow.12 These measurements were then used to calculate mPG and AVA, which were used to assess the presence and severity of AVS.11
2.4 Ascertainment of Clinical Information and Outcome Definition
The clinical data were acquired by a dedicated review of the electronic health records at the study institutions. The clinical outcome was defined as a composite endpoint of cardiovascular death, hospitalization for heart failure, and AV replacement via surgical or transcatheter approaches.
2.5 Validation of AI-Based AVS Evaluation System and Statistical Analysis
Our AI-based framework was validated using an internal test dataset (ITDS) and two external datasets (DHDS and TDDS). The view classification algorithm, the shared initial step, was evaluated against human expert labels. Precision, recall, and F1 scores were calculated for each view, with overall accuracy determined by the ratio of correctly classified images to the total number of images.
Subsequently, we evaluated the two AI-based pathways. The performance of the DL-based AVS continuum assessment algorithm was evaluated by examining the distribution of the DLi-AVSc across various stages using violin plots. We also assessed the correlation of DLi-AVSc with conventional parameters (AV Vmax, mPG, and AVA). To verify that DLi-AVSc accurately reflects the continuum of AVS progression, we used Uniform Manifold Approximation and Projection (UMAP) to visualize this relationship,17 projecting the data into a 2D space, using 15 nearest neighbors, a minimum distance of 0.1, and the Euclidean distance. To highlight the areas with the greatest influence on the model’s prediction, we generated saliency maps using the Gradient-weighted Class Activation Mapping (Grad-CAM).18 We present representative samples for each severity level in both PLAX and PSAX views.
The conventional AVS assessment algorithm was validated by comparing AI-derived parameters with manual measurements. Since these parameters are not typically measured in normal or AV sclerosis groups, the comparison was limited to the AVS group. Moreover, as manual measurements were not always available for all AVS cases, details on ground truth measurements availability and the success rate of automatic measurements are provided in Supplemental Methods 7. The correlation between automated and manual measurements was assessed using the Pearson Correlation Coefficient (PCC). The AVS severity determined from the automatic measurements was also compared to the ground truth label made by the clinician’s prior decision.
We also evaluated the discrimination ability of the DLi-AVSc and other AI-derived conventional parameters for various stages of AVS, including mild or greater AVS (any AVS), moderate or greater AVS (significant AVS), and severe AVS. This evaluation was conducted through receiver operating characteristic (ROC) curve analysis, from which we calculated the area under the curve (AUC).
Lastly, we assessed the prognostic capability of AI-derived parameters for composite endpoints. Specifically, we conducted a spline curve analysis for our novel index, the DLi-AVSc, to visualize its predictive power. Additionally, we applied Cox regression analysis to validate the prognostic relevance of the DLi-AVSc and other AI-derived AVS parameters, with adjustment for clinical risk factors (age, sex, body mass index, hypertension, and diabetes).
3. RESULTS
3.1 Baseline Characteristics
The distribution of AVS severity across three datasets is shown in Table 1: ITDS (n=841), DHDS (n=1,696), and TDDS (n=772). ITDS and TDDS exhibited a higher prevalence of mild AVS (28% and 41%, respectively), with fewer moderate and severe cases. Conversely, DHDS displayed a more balanced severity distribution (12% mild, 15% moderate, and 12% severe, respectively). Baseline clinical characteristics are available in Supplemental Result 1.
3.2 View Classification
Our view classification algorithm accurately identified the required images for assessing AVS across all datasets. The overall accuracy rates were 99.6% for ITDS, 99.5% for DHDS, and 99.4% for TDDS. Detailed metrics are in Supplemental Result 2.
3.3 Performance of DL-Based AVS Continuum Assessment Algorithm
The distribution of the DLi-AVSc, produced by the DL-based AVS continuum assessment algorithm, exhibited a consistent trend of increasing scores with the severity of AVS across all datasets. (Figure 1A) Interestingly, at the AV sclerosis stage, the DLi-AVSc already significantly increased compared to the normal stage, indicating the algorithm’s ability to detect early changes. When discordant cases excluded from the training dataset were included in the ITDS, mild to moderate and low-flow, low-pressure gradient moderate AVS were distributed between mild and moderate AVS, while moderate to severe and low-flow, low-pressure gradient severe AVS were distributed between moderate and severe AVS. (Supplemental Results 3) The DLi-AVSc demonstrated an increasing trend as conventional parameters assessing AVS severity, such as AV Vmax, mPG, and AVA, worsened. (Supplemental Results 4).
Furthermore, when we utilized UMAP to verify that the DLi-AVSc accurately represents the AVS continuum, the DLi-AVSc, derived from the approach incorporating both ordered labels and multi-task learning, displayed a distinct continuous gradient from normal through AV sclerosis to advancing AVS stages, consistently evident in ITDS and both external datasets. (Figure 1B) In contrast, a conventional multi-class classification approach using 5-class cross-entropy loss resulted in the stage-based grouping but lacked the continuous progression seen in our approach. The continuous mapping with ordered labels approach, but without additional multi-task learning to predict key TTE parameters, appeared somewhat linear but did not accurately reflect the severity progression. (Supplemental Results 5)
For each severity level, we present representative samples with Grad-CAM saliency maps overlaid on both PLAX and PSAX views, specifically localizing the AV. (Supplemental Results 6 and Video S3) These results demonstrate that our model accurately identifies the relevant regions for evaluating AVS across all severity levels and views without supervision.
3.4 Performance of Automated Conventional Assessment Algorithm
Our algorithm’s automatic measurements demonstrated high correlations with the ground truth values for AV Vmax (PCC 0.974-0.991) and mPG (PCC 0.966-0.991). (Figure 2A) The correlation for AVA (PCC 0.789-0.887) was also good but relatively lower than Vmax and mPG, as AVA is calculated from multiple measurements. Missing measurements resulted in fewer comparison cases (Supplemental Methods 7), and accumulated differences affected the overall accuracy. The overall accuracy of AVS severity classification among any AVS based on these automated measurements was 98.2% for ITDS, 81.0% for DHDS, and 96.8% for TDDS. (Figure 2B)
3.5 Comparison of Diagnostic Performance of Two Different AI-Based Approach
The discrimination performance of DLi-AVSc for various stages of AVS was generally excellent: AUC 0.87-0.99 for any AVS, 0.93-0.97 for significant AVS, and 0.97 for severe AVS (Figure 3) When compared to automatically measured conventional parameters, in ITDS, the discrimination performance of DLi-AVSc was lower than that of automatically measured Vmax and mPG but comparable to AVA. In DHDS, the performance of DLi-AVSc surpassed AVA in diagnosing all stages of AVS, while in TDDS, it was comparable to AVA for diagnosing significant and severe AVS.
3.6 Prognostic Value of AI-Based AVS Assessment
Analysis of spline curves across the ITDS, DHDS, and TDDS showed that an increase in DLi-AVSc correlated with a rising risk of adverse clinical outcomes. (Figure 4) The multivariable Cox regression analysis affirmed the strong and independent prognostic value of DLi-AVSc. A 10-point increase in DLi-AVSc from limited TTE videos was associated with an 85% increase in adverse outcome risk in ITDS and a 53 and 59% increase in DHDS and TDDS, respectively. (Figure 5) Moreover, the AI-derived parameters, such as Vmax, mPG, and AVA, demonstrated prognostic values comparable to those of manually-derived parameters. (Figure 5)
DISCUSSIONS
We have developed and validated a comprehensive AI-based system to evaluate AVS, suitable for both resource-limited and advanced settings. It addresses AVS through the dual pathway: 1) It can evaluate the presence and severity of AVS using only the PLAX and PSAX videos initially acquired during TTE, and 2) if additional images are obtained in advanced settings, it can automatically analyze these to diagnose and assess AVS using conventional methods. Internal and external validation demonstrated excellent diagnostic accuracy and strong prognostic capabilities.
While our AI-based system is not the first to evaluate AVS, it stands apart from previous studies by enabling both automation of conventional measurements and evaluation using limited 2D TTE videos. Prior research has typically focused on one of these aspects. For instance, Krishna et al. developed an AI model to automate quantitative AVS evaluation.6 However, their model did not include the crucial initial visual analysis of the AV from 2D TTE videos, which is essential for initiating conventional quantitative AVS analysis. Several studies used CNNs to extract AVS-related features from 2D TTE videos through end-to-end learning without requiring Doppler information.5,7,19,20 Although these studies achieved decent performance in classifying AVS severity, they lack conventional evaluation of AVS, compromising trustworthiness, explainability, and interpretation. Our system is the first to integrate both approaches, making it suitable for both resource-limited and advanced settings and even as a hybrid solution. Since PLAX and PSAX views are typically acquired at the initial stage of TTE, our system can use these views to derive the DLi-AVS, indicating high probability of significant AVS and prompting the acquisition of additional views for conventional AVS evaluation. This approach can guide less experienced operators, reducing image acquisition and interpretation errors. For example, if AV CW Doppler is not properly acquired, it could lead to AVS underestimation or misinterpretation of low-flow, low-pressure gradient AVS. In that case, a high DLi-AVSc can suggest the likelihood of significant AVS, thereby guiding further necessary evaluations.
Another strength of our study is that, unlike previous research, it reflects the continuous nature of AVS progression. For instance, Wessler et al. trained CNNs to classify AVS severity into three categories (no, early, and significant AVS) using limited 2D images.7 Similarly, Ahmadi et al. proposed a transformer-based spatiotemporal architecture to classify AVS into four categories (normal, mild, moderate, and severe AVS) by capturing anatomical features and AV motion.19 Vaseli et al. focused on model explainability in AVS severity classification, incorporating uncertainty estimation and classifying AVS severity into three classes (no, early, and significant AVS).20 However, these classifiers discretize AVS severity, losing the continuum information of AVS. Recently, Holste et al. proposed a binary classifier based on the 3D-ResNet18 architecture to detect severe AVS, observing that model probabilities generated increase with AVS severity.5 However, this model focused only on a binary classification task (e.g., non-severe vs. severe), not capturing the full range of AVS severity levels in the training stage. In contrast, our framework employs continuous mapping with ordered labels, providing a more nuanced representation of AVS severity. Importantly, we use multi-task learning with auxiliary tasks to predict continuous AVS TTE parameters. This approach not only transitions from discrete labels to continuous values but also captures the underlying continuum of the disease more effectively. In UMAP visualizations, our model demonstrates a clear continuous gradient from normal to severe AVS, unlike other classification models. Additionally, the appropriate distribution of DLi-AVSc in discordant cases further supports the performance of our framework. It should be noted that our dataset was collected entirely from tertiary hospitals. Therefore, it is significant that our model can diagnose and predict AVS outcomes at a level comparable to parameters derived in advanced settings.
The implications of our AI-based system extend beyond precise AVS diagnosis. Our DLi-AVSc exhibits significant prognostic capability, comparable to traditional AVS parameters, even when utilizing only PLAX and PSAX views. Moreover, the DLi-AVSc increases notably from normal levels at AV sclerosis and mild AVS stages before significant AVS progression. To our knowledge, this is the first algorithm to achieve such performance. DLi-AVSc is poised to effectively monitor AVS progression from preclinical stages as a score-based tool. We anticipate the clinical utility of our system becoming prominent, especially as new pharmacological treatments are investigated for AVS prevention.21,22 If such treatments become available, our algorithm’s sensitivity in detecting early AVS stages will be highly advantageous.
Limitations
The present study has some limitations. Although we developed and thoroughly validated our AI-based system using data from multiple centers, including internal and external validation, all the data were obtained from tertiary centers in South Korea. This means that skilled operators acquired TTE, and it remains to be seen if the DLi-AVSc will perform well on TTE videos acquired in truly resource-limited and novice settings. Further evaluation is needed to confirm its performance in various clinical environments and among different populations. We plan to conduct additional validation in primary clinics and a multi-national study to address these concerns. Additionally, while we designed the DLi-AVSc to reflect the AVS continuum, it needs to be verified whether the DLi-AVSc increases progressively with the natural progression of AVS. This issue will be addressed in future studies.
Conclusions
We developed and validated a comprehensive AI-based system for evaluating AVS. This system operates through a dual pathway: it assesses the presence and severity of AVS using limited TTE videos and simultaneously automates conventional quantitative AVS evaluation. Internal and external validations demonstrated excellent diagnostic accuracy and strong prognostic capabilities. While additional validation in various clinical settings is needed, our system is expected to be suitable for both resource-limited and advanced settings.
CLINICAL PERSPECTIVES
Competency in Medical Knowledge
Echocardiography is a primary imaging tool for evaluating aortic valve stenosis (AVS), necessitating advanced expertise. This study demonstrates the feasibility and high accuracy of an AI-enhanced system in diagnosing and assessing the severity of AVS. The AI system provides a severity index derived from limited echocardiographic images and automatically measures conventional AVS parameters, showing a higher agreement with expert assessments and potential value in predicting outcomes.
Translational Outlook
The current AI system can accurately identify AVS and assist in the precise clinical evaluation of AVS. The clinical benefit of this AI system in managing AVS patients, particularly regarding long-term improvements in clinical outcomes, needs to be validated in further prospective clinical trials.
Data Availability
The AI-based frameworks utilized in this study were developed and validated using the Open AI Dataset Project (AI-Hub) dataset, an initiative supported by the South Korean government's Ministry of Science and ICT.