Abstract
Objective To develop and evaluate an automated methodology for dimensionality reduction of the FDA’s MAUDE database through schema matching and merging.
Methods We conducted 96 trails integrating clustering algorithms with semantic similarity evaluations using the DeepSeek V2.5 API. This approach identified and merged semantically similar tables. Feature extraction was performed using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization and Sentence Transformer embeddings. The methodology was assessed against manual groupings provided by domain experts using metrics such as Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), precision, recall, and F1 score. Different similarity thresholds (0.7, 0.8, 0.9) were applied to evaluate their impact on table merging performance.
Results The integration of clustering with semantic similarity evaluations enhanced the F1 score from 0.51 (clustering alone) to 1.00, utilizing fewer than 1,425 API similarity evaluations. Consequently, the number of tables was compressed from 113 to 13–16 table groups, a reduction of 86% to 89%. In addition, the application of clustering algorithms decreased the number of table pair comparisons by 77% to 83%. Sentence Transformer embeddings outperformed TF-IDF vectorization in clustering performance, with F1 scores increasing from a range of approximately 0.51–0.87 to 0.51–0.95 in clustering-only scenarios. DeepSeek V2.5 demonstrated the potential to match and quantify subtle semantic differences across various similarity thresholds, maintaining high merging accuracy with F1 scores reaching up to 1.00.
Conclusion The proposed automated dimensionality reduction methodology effectively enhances data quality and analysis efficiency within the MAUDE database. By reducing the number of tables to manageable groups, optimizing context lengths, and leveraging DeepSeek V2.5’s semantic matching capabilities, the framework streamlines data processing and ensures compatibility with advanced analytical tools such as Large Language Models (LLMs). This makes the methodology applicable across various industries, facilitating more efficient and accurate data analysis workflows
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Fund Program for the Scientific Activities of Selected Returned Overseas Professionals in Shanxi Province
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
U.S. Food and Drug Administration (FDA). Manufacturer and User Facility Device Experience (MAUDE). 2024. Accessed November 1, 2024. https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude-database
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced are available online at https://github.com/leiMizzou/Maude-Schema-Analysis