TY - JOUR T1 - A Generalizable Data Assembly Algorithm for Infectious Disease Outbreaks JF - medRxiv DO - 10.1101/2021.04.21.21255862 SP - 2021.04.21.21255862 AU - Maimuna S. Majumder AU - Sherri Rose Y1 - 2021/01/01 UR - http://medrxiv.org/content/early/2021/04/27/2021.04.21.21255862.abstract N2 - Background & Objective During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is usually text-based and rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across three outbreaks.Methods After developing an algorithm with regular expressions, we automatically curated data from health agencies via three information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak.Findings When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all three outbreaks.Conclusions Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.Competing Interest StatementThe authors have declared no competing interest.Funding StatementResearch reported in this work was supported by the National Institutes of Health through an NIH Director's New Innovator Award DP2-MD012722. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This manuscript uses publicly-available data exclusively and thus did not necessitate IRB approval.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesPlease refer to the manuscript for a link to the study's Github repository. ER -