Abstract
Background Adverse social determinants of health (SDoH), or social risk factors, such as food insecurity and housing instability, are known to contribute to poor health outcomes and inequities. Our ability to study these linkages is limited because SDoH information is more frequently documented in free-text clinical notes than structured data fields. To overcome this challenge, there is a growing push to develop techniques for automated extraction of SDoH. In this study, we explored natural language processing (NLP) and inference (NLI) methods to extract SDoH information from clinical notes of patients with chronic low back pain (cLBP), to enhance future analyses of the associations between SDoH and low back pain outcomes and disparities.
Methods Clinical notes (n=1,576) for patients with cLBP (n=386) were annotated for seven SDoH domains: housing, food, transportation, finances, insurance coverage, marital and partnership status, and other social support, resulting in 626 notes with at least one annotated entity for 364 patients. We additionally labelled pain scores, depression, and anxiety. We used a two-tier taxonomy with these 10 first-level ontological classes and 68 second-level ontological classes. We developed and validated extraction systems based on both rule-based and machine learning approaches. As a rule-based approach, we iteratively configured a clinical Text Analysis and Knowledge Extraction System (cTAKES) system. We trained two machine learning models (based on convolutional neural network (CNN) and RoBERTa transformer), and a hybrid system combining pattern matching and bag-of-words models. Additionally, we evaluated a RoBERTa based entailment model as an alternative technique of SDoH detection in clinical texts. We used a model previously trained on general domain data without additional training on our dataset.
Results Four annotators achieved high agreement (average kappa=95%, F1=91.20%). Annotation frequency varied significantly dependent on note type. By tuning cTAKES, we achieved a performance of F1=47.11% for first-level classes. For most classes, the machine learning RoBERTa-based NER model performed better (first-level F1=84.35%) than other models within the internal test dataset. The hybrid system on average performed slightly worse than the RoBERTa NER model (first-level F1=80.27%), matching or outperforming the former in terms of recall. Using an out-of-the-box entailment model, we detected many but not all challenging wordings missed by other models, reaching an average F1 of 76.04%, while matching and outperforming the tested NER models in several classes. Still, the entailment model may be sensitive to hypothesis wording and may require further fine tuning.
Conclusion This study developed a corpus of annotated clinical notes covering a broad spectrum of SDoH classes. This corpus provides a basis for training machine learning models and serves as a benchmark for predictive models for named entity recognition for SDoH and knowledge extraction from clinical texts.
Competing Interest Statement
DL is a shareholder of Crosscope Inc and SynthezAI Corp. BL is supported by Innovate for Health Data Science Fellowship from Johnson & Johnson. PLA received funding from REAC RAP UCSF through UCSF. EDM received support from Hellman Fellows Fund Payment, Episcopal Health Foundation, and REAC RAP UCSF through UCSF. SP received support from Back Pain Consortium (BACPAC) grant through UCSF
Funding Statement
This study was supported by Back Pain Consortium (BACPAC) grant, UCSF Core Center for Patient-centric Mechanistic Phenotyping in Chronic Low Back Pain (UCSF REACH), UCSF Social Interventions Research and Evaluation Network (SIREN), the Innovate for Health program, including the UC Berkeley Institute for Data Science, the UCSF Bakar Computational Health Sciences Institute and Johnson & Johnson (to BL).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of University of California, San Francisco gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Raw data analyzed in the present study are not available due to privacy concerns. Resulting summary statistics are contained in the manuscript.