Abstract
The proliferation of scientific podcasts has generated an extensive repository of audio content, rich in specialized terminology, diverse topics, and expert dialogues. Here, we introduce a computational framework designed to enhance large language models (LLMs) by leveraging this informational content from publicly accessible podcast data across science, technology, engineering, mathematics and medical (STEMM) disciplines. This dataset, comprising over 3, 700 hours of audio content, was transcribed to generate over 42 million text tokens. Our model, PodGPT, integrates this wealth of complex dialogue found in audio podcasts to improve understanding of natural language nuances, cultural contexts, as well as scientific and medical knowledge. PodGPT also employs retrieval augmented generation (RAG) on a vector database built from articles in Creative Commons PubMed Central and The New England Journal of Medicine, enhancing STEMM research and education by providing real-time access to emerging scientific literature. Evaluated across multiple benchmarks, PodGPT demonstrated an average improvement of 3.51 percentage points over standard open-source benchmarks and 3.81 percentage points when augmented with evidence from the RAG pipeline. Moreover, it showcased an average improvement of 4.06 percentage points in its zero-shot multi-lingual transfer ability, effectively generalizing to different linguistic contexts. By harnessing the untapped potential of podcast content, PodGPT advances natural language processing and conversational AI, offering enhanced capabilities for STEMM research and education.
Competing Interest Statement
V.B.K. is a co-founder and equity holder of deepPath Inc. and CogniScreen, Inc. He also serves on the scientific advisory board of Altoida Inc. R.A. is a scientific advisor to Signant Health and NovoNordisk. The remaining authors declare no competing interests.
Funding Statement
National Institutes of Health
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
We updated the dataset for model training, which resulted in new findings.
Data Availability
All data produced are available online at https://github.com/vkola-lab/PodGPT.