Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Richard Wyss; Jie Yang; Sebastian Schneeweiss; Joseph M. Plasek; Li Zhou; Thomas Deramus; Janick G. Weberpals; Kerry Ngan; Theodore N. Tsacogianis; Kueiyu Joshua Lin

doi:10.1101/2025.01.30.25321403

ABSTRACT

Background To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors (‘proxy’ confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from EHRs. Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.

Objective To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.

Methods We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used LASSO regression to fit several PS models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.

Results Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being <0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.

Conclusion Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.

Competing Interest Statement

Dr. Schneeweiss is participating in investigator-initiated grants to the Brigham and Womens Hospital from Boehringer Ingelheim and UCB unrelated to the topic of this study. He is a consultant to Aetion Inc., a software manufacturer of which he owns equity. His interests were declared, reviewed, and approved by the Brigham and Womens Hospital in accordance with their institutional compliance policies. All other authors declare no competing interests for this work.

Funding Statement

This project was funded by NIH RO1LM013204; additional funding was provided by PCORI ME-2022C1-25646.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Mass General Brigham (MGB) Institutional Review Board gave ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.