Abstract
Objective Information on patient social determinants of health is frequently recorded in unstructured clinical notes, making it inaccessible for researchers and policymakers. We aimed to extract and validate food and housing insecurity status on a large electronic health record-derived patient cohort by combining selective prediction and active learning.
Materials and Methods Manually labeled charts selected via active learning were used to train L1-regularized logistic regression models to identify the presence of food insecurity (N=372, 42% event rate) and housing insecurity (N=559, 36% event rate) in clinical notes. In addition to validating predictions against labeled data, we further validated predictions on an additional unlabeled dataset through associative studies with demographic, clinical, and environmental variables with known associations with food and housing insecurity.
Results The food insecurity model had AUC=0.83, sensitivity=0.90, PPV=0.90, and undetermined rate=0.59 (n=149); the housing insecurity model had AUC=0.81, sensitivity=0.50, PPV=1, and undetermined rate=0.65 (n=224). Out of 4,337 unlabeled patients, the 395 (9%) patients predicted to have food insecurity were more likely to be Hispanic/Latino (48% vs 24%, p<0.001) and have diabetes (34% vs 12%), hypertension (43% vs 11%), and heart disease (12% vs 0.7%) (p<0.001 for all).
Discussion Selective prediction and active learning can facilitate efficient labeling of social determinants of health from unstructured EHR data to identify vulnerable populations and targets for healthcare system and policy intervention.
Conclusion Machine learning can be used to extract high-fidelity information on patient food and housing insecurity status.
Introduction
Social determinants of health (SDOH)—including access to housing, transportation, food, and access to healthcare—significantly influence health outcomes and alter the course of clinical care.1 SDOH factors have been linked to several health outcomes, including increased risk for chronic conditions, hospitalization, and mortality. Healthcare systems have sought to develop SDOH interventions but may struggle to identify patients with SDOH needs.1,2 To address this challenge, the National Academy of Medicine recommended that SDOH factors should be incorporated into Electronic Health Records (EHRs) to allow for improved research, policy, and quality improvement, spurring the development of a wide range of social risk screening tools.3
Despite the increasing use of social needs screens at point-of-care settings, most SDOH factors are captured as unstructured data, such as clinical notes, which can be difficult to analyze quantitatively. Traditional methods to identify SDOH needs include manual chart review, which is time-consuming. Advances in natural language processing (NLP) allow for the development of more efficient, automated and semi-automated approaches to unlock insights from unstructured clinical EHR notes. NLP models to study unstructured SDOH data may inform resource allocation and interventions by healthcare systems and can be used to improve risk stratification, clinical decision support algorithms, and screening tools.
Housing is one such SDOH factor that can significantly affect health, especially for patients with chronic conditions. At the individual level, housing insecurity reduces a person’s ability to care for conditions that require frequent monitoring or medication adherence. Given the well-researched link between housing and health, government and healthcare systems have begun to adopt housing interventions, which are shown to improve health outcomes and reduce costs.4 Housing status is frequently under-captured as structured data in EHR and clinical visits. In a study of a large EHR cohort of 1.2 million patients, approximately 3% of patients (35,000) had housing issues that were captured in unstructured data, compared to 0.19% of patients whose housing needs were in a structured format.5
Food insecurity is another SDOH factor impacting health, particularly through nutritional inadequacy, yet few studies have aimed to quantify the presence in EHR. For example, a recent systematic review of NLP extraction of SDOH variables identified only one paper that used machine learning (ML) methods to assess food insecurity, and this study did not use a naive test set and did not correct for negation or misspellings.6,7
Several chronic conditions are adversely affected by housing and food insecurity status, including coronary artery disease (CAD), hypertension (HTN), Type 2 Diabetes Mellitus (T2DM), hepatitis, and stroke.8–12 The relationship between diet and these chronic medical conditions is well-established. Adult patients who are food insecure, or have limited access to safe, adequately nutritious food, are more likely to have an increased risk for hypertension, with one study reporting a 21% increase in risk. Food insecurity is also associated with an increased prevalence of diabetes, as well as the co-occurrence of multiple chronic conditions.4 Food insecurity is associated with a higher intake of lower nutrition-quality food, thereby leading to higher rates of HTN and CVD. Patients who are housing insecure report poorer physical health outcomes, including hypertension and cardiovascular disease. One hypothesized mechanism is that patients who are housing insecure are less likely to access healthcare and preventive services, given their higher priority housing needs.4
Previous studies have used NLP models to assess SDOH status from clinical notes and have been able to abstract information on substance use, housing status, smoking status, alcohol use, social support, education, and financial needs.7 However, these studies face some important limitations. First, few studies have explored using ML and NLP to extract food insecurity status. Second, existing ML approaches to extract SDOH data do not adequately address variable model accuracy and the expensive process of data labeling. Third, few studies that use NLP to extract SDOH data apply models to unlabeled data and validate the resulting predictions.
This study aims to identify social risk factors (i.e. food insecurity and housing insecurity) in unstructured EHR data and using an NLP model and prospectively predict the presence of social risk in a patient’s EHR. Our approach addresses these limitations by using selective prediction, which gives prediction models to abstain from generating a prediction on uncertain charts, thus minimizing misclassifications; by selecting charts via active learning, which increases the efficiency of data collection by prioritizing charts that the model could most effectively learn from; by applying our model to housing insecurity and food insecurity, an understudied SDOH variable; and by validating model predictions using demographic (gender, race, ethnicity), geographic (zip codes), and clinical (diabetes, hypertension, heart disease, etc.) variables with known associations with food and housing insecurity. Our models can help researchers, providers, and policymakers identify patients with food and housing insecurity at scale.
Methods
Approach
First, manual abstraction of SDOH variables was performed by human abstractors (Figure 1a). Second, natural language processing (NLP) selective prediction models were developed using the data collected by manual abstraction. In contrast to typical ML models that generate a prediction for any input, selective prediction models may abstain from generating a prediction for specific data points.13 Third, we leveraged selective prediction modeling to generate predictions for data points that the model can predict with high accuracy and used active learning via uncertainty sampling to improve the efficiency of manual abstraction. Lastly, fully trained models were applied to unlabeled data and the resulting predictions were combined with manually labeled data to create an SDOH dataset used for downstream analysis.
Diagram depicting the modeling workflow used to extract unstructured variables. EHR data is abstracted, then used to train and test a model that either predicts an outcome or abstains from a prediction. Charts for which a prediction is not generated are passed through an active learning framework via uncertainty sampling. These charts are then labeled and incorporated into the labeled dataset. This process was repeated twice for a total of three successively increasing labeled datasets.
Map of predicted (top) and labeled (bottom) prevalence of food insecurity for Bay Area zip codes. Zip codes with at least 15 patients for the predicted cohort and at least 5 patients for the labeled cohort were included.
Map of predicted (top) and labeled (bottom) prevalence of housing insecurity for Bay Area zip codes. Zip codes with at least 15 patients for the predicted cohort and at least 5 patients for the labeled cohort were included.
Data source
This study used a retrospective analysis of EHR data collected at Stanford Hospital, Stanford, CA, with data including any patient treated at Stanford Hospital from 1998-2022. Ethics approval was granted through Stanford University IRB (#65834).
Study Population
Housing cohort
Patients with mention of the word “unhoused”, “shelter”, “homelessness” or “housing” in a clinical note between November and December 2022 were included. Clinical notes with these phrases were selected as they had a high likelihood of identifying individuals with housing needs. The remaining 1,705 patients were considered for subsequent analyses. Of these, 400 patients were selected for initial manual abstraction. Additional phrases described in the literature for housing insecurity, including “street” or “incarcerated”, led to a significant number of false positives due to improper settings (e.g. “supervise all play on streets/in driveways” or “incarcerated hernia”). We omitted these phrases to conserve model fidelity.
Food insecurity cohort
The food insecurity cohort was created by selecting patients that had a mention of the phrases “hunger”, “not enough food”, “food insec”, “food bank”, “food pantr”, “food stamp” in a clinical note between December 2010 and December 2022. The remaining 4,709 patients were considered for subsequent analyses. Of these, 200 patients were selected for initial manual abstraction.
Outcomes
Food insecurity
Food insecurity was defined as any mention of limited access to nutritionally adequate and safe food, including patients who reported lack of food, consistent hunger, or inability to afford food; or patients receiving food stamps or other financial assistance for obtaining food (e.g. WIC, CalFresh program).14,15 This definition included individuals with both active food insecurity and a past history of food insecurity.
Housing insecurity
Housing insecurity was defined as any mention of inconsistent access to safe housing (e.g. homelessness, unstable housing). This included individuals with both active housing insecurity and a past history of housing insecurity.4
Data abstraction
Of the patients that met the initial inclusion criteria, 200 and 400 patients were randomly selected for manual abstraction for food insecurity and housing insecurity, respectively. First, a chart review was performed to understand how food and housing insecurity was documented in the EHR. Next, a standardized abstraction data entry tool was used in order to refine search terms that were predictive of housing and food insecurity. The data entry tool had a “flag” feature that abstractors could use to indicate charts for which they were not able to make a determination. Lastly, all flagged charts were reviewed and reconciled.
Statistical Analysis
Feature engineering and feature selection
For each variable, modeling was performed at the level of a patient chart. A single chart was randomly selected per patient. For simplicity, we henceforth refer to the unit of analysis as the chart. A randomly selected 60% of charts were used for model training, and the remaining 40% were used for model testing. Features were derived entirely from the free text of clinical notes. Using the text of these documents, TF-IDF (term-frequency-inverse-document-frequency) scores were calculated for unigrams, bigrams, and trigrams. Briefly, TF-IDF is a method used to quantify the relative frequency of an n-gram (an n-word phrase) in a corpus of documents. For example, the TF-IDF score for the unigram “insecurity” in a patient’s clinical report is calculated by dividing the frequency of the word “insecurity” in that patient’s clinical report (term frequency) by the frequency of the word “insecurity” across all clinical reports across all patients (inverse document frequency). L1-regularized logistic regression (Lasso) was used for feature selection and prediction. Hyperparameters (regularization for Lasso and sparsity for document term matrix) were tuned using 10-fold cross-validation. The metric for cross-validation was misclassification cost (described below). Selected features were examined for sensibility and clinical meaningfulness (SI Table 7).
Selective prediction
Selective prediction allows a model to abstain from making a prediction under certain circumstances.13,16 Consider a prediction function, f, that generates a continuous prediction given an input observation x. Let g represent the abstinence function — a function that is 1 if the model generates a prediction and 0 if the model abstains. Let h represent the selective prediction function, defined as
Where the empty set operator denotes the absence of a prediction. We henceforth describe this outcome as “undetermined”. In this study, f is the predicted probability of an L1-regularized logistic regression model and g is defined as follows:
Where p and q represent distinct probability thresholds. Therefore, the resulting selective classifier, c, is defined as:
Following a previously described methodology for selective prediction via utility-based thresholding17, we select p and q to minimize the total cost of misclassification, defined as
where C is cost, N is the number of observations, and the subscripts FP, FN, and U refer to false positives, false negatives, and undetermined, respectively. Here, “cost” refers not to monetary cost but rather real-world utilities. In the example of food insecurity, the cost of a false positive corresponds to the cost of incorrectly labeling a chart as having food insecurity, and the cost of a false negative corresponds to the cost of incorrectly labeling a chart as not having food insecurity. The former results in incorrectly including a patient in the final dataset, whereas the latter results in incorrectly excluding a patient from the final dataset. Following a previously-established regret-based approach, we set the relative costs of FP:FN:undetermined to be 13.5:8:1.17 This means that the model returning a false positive is 13.5 times more costly than abstaining. Similarly, a false negative result is 8 times more costly than the model abstaining.
Model evaluation
Area under the receiver operator curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated, treating the abstracted labels as ground truth. For selective prediction models, we calculated undetermined rate, positive undetermined rate, and negative undetermined rate17. Positive undetermined rate is defined as the number of positive observations labeled as undetermined (undetermined positives or UP) over the total number of positive observations (TP + FN + UP). Similarly, the negative undetermined rate is defined as the number of negative observations (UN) labeled as undetermined over the total number of negative observations (TN + FP + UN). The undetermined rate is defined as the number of predictions a selective prediction model abstains from making (UN + UP), over the total number of possible predictions (UN + UP + FN + FP + TN + TP). A selective prediction model with a lower undetermined rate yields greater efficiency gains than a model with a larger undetermined rate and similar levels of accuracy.
After training, the final model was applied to all unlabeled data, yielding three types of predictions: “positive”, “negative” or “undetermined”. For charts labeled positive or negative, the model’s prediction was considered final. Charts labeled “undetermined” were those for which the model abstained from making a definitive prediction.
Active learning via uncertainty sampling
We used uncertainty sampling, an active learning technique, to improve the efficiency of manual abstraction.18 Active learning involves “actively” selecting data points for training such that the model learns faster compared to random selection of training data points.19 The idea is that if N training data points are required to reach a certain model performance using random sampling, fewer than N data points would be required to reach the same model performance using active learning. In uncertainty sampling, training data points that maximize a predefined measure of model uncertainty are prioritized for labeling. The motivation behind this approach is that the model can improve fastest by learning from data points about which it is currently uncertain. After initial model training, we selected the 100 data points whose predicted probabilities were between the two probability thresholds p and q (and therefore undetermined) and furthest away from either threshold. This corresponds to picking the 100 undetermined data points with the highest uncertainty score u, defined as
While it has been correctly pointed out that predicted probabilities are not a measure of model uncertainty, our goal was to help the model learn data points that fall in the undetermined region.20 Therefore, we defined u such that maximizing u would surface data points that fell squarely in the undetermined region. We performed two rounds of active learning following the initial round of model training.
Comparison of demographic and clinical characteristics
We compared relevant demographic and clinical characteristics between labeled and unlabeled datasets to ensure that the data used to develop the model (labeled) was similar to the data that the model was applied to (unlabeled). We also compared demographic and clinical characteristics between outcome groups (presence vs. absence of SDOH variable) in both the labeled and unlabeled (predicted) datasets to compare the actual (labeled) and predicted (unlabeled) relationships between the SDOH variable of interest and patient demographic and clinical characteristics. Demographic variables included patient gender, race, ethnicity, and need for an interpreter during a clinical encounter. Clinical variables were defined as the presence of at least one International Classification of Diseases (ICD) code for diabetes (insert codes), hypertension (insert codes), stroke (insert codes), hepatitis (insert codes), and heart disease (insert codes).
Geographic distribution of SDOH variables
To compare the actual and predicted geographic distribution of SDOH variables, we generated choropleth maps of the San Francisco Bay Area showing the prevalence of SDOH variables by patient zip code. To protect patient confidentiality, only zip codes with at least 5 patients were mapped. Shapefile data was obtained from Data SF.21
Results
Patient characteristics
All three cohorts of food insecure and housing insecure patients are similar for demographic and clinical characteristics (Table 1). The percentage of female patients in the training, test and unlabeled set was 50%, 50%, and 57%, respectively. The distribution across racial groups was similar in all three cohorts. The largest percentage of patients identified as Other (37% in the training set, 36% in the test set, 32% in the unlabeled set), followed by White (36%, 34%, 32%). The distribution across ethnicity was similar in all three cohorts. The largest percentage of patients identified as non-Hispanic (49% in the training set, 44% in the test set, 47% in the unlabeled set). Additionally, a similar percentage of patients needed an interpreter in all three cohorts (22%/23%/19% in train/test/unlabeled). The prevalence of diabetes (18%/17%/19% in train/test/unlabeled), hypertension (24%/27%/21%), heart disease (9.9%/3.4%/4.8%), and stroke (1.3%/2%/1.6%) were comparable in all three cohorts. The training and test cohorts had a similar prevalence of food insecurity (44% in train, 40% in test).
Demographic and clinical characteristics of food and housing insecurity of training, testing, and unlabeled cohorts. Demographic variables include patient gender, race, ethnicity, and need for an interpreter during a clinical encounter. Clinical variables were defined as the presence of at least one International Classification of Diseases (ICD) code for diabetes (insert codes), hypertension (insert codes), stroke (insert codes), hepatitis (insert codes), and heart disease (insert codes).
The percentage of female patients in the training, test and unlabeled set was 52%, 55%, and 60%, respectively. The distribution across racial groups was similar in all three cohorts. The largest proportion of patients identified as White (43% in the training set, 44% in the test set, 46% in the unlabeled set), followed by Other (23%, 21%, 24%). The distribution across ethnicity was similar in all three cohorts. The largest percentage of patients identified as non-Hispanic (79% in the training set, 78% in the test set, 76% in the unlabeled set). Additionally, a similar percentage of patients needed an interpreter in all three cohorts (9%, 4.5%, 7.3% in train/test/unlabeled). The prevalence of diabetes (32%/29%/28% in train/test/unlabeled), hypertension (39%/43%/37%), heart disease (11%/13%/12%), and stroke (6.9%/6.3%/3.7%) were comparable in all three cohorts. The training and test cohorts had a similar prevalence of food insecurity (35% in train, 39% in test).
Effect of active learning on training and test set composition
The proportion of food insecure patients increased in the labeled datasets via the use of active learning (Table 2). In the initial labeled set (Round 1), 29% of patients had food insecurity, compared to 56% and 58% in the patients added via active learning in Rounds 2 and 3, respectively. Similarly, the use of active learning led to increases from Round 1 to Rounds 2 and 3 in the proportion of Hispanic/Latino patients (33% to 37% to 54%), non-English speaking patients (20% to 22% to 28%), and patients with chronic conditions.
Comparison of demographic and clinical characteristics of food and housing insecurity of labeled cohort between initial training and test set (Round 1), patients added after one round of active learning (Round 2) and patients added after a second round of active learning (Round 3).
The proportion of housing insecure patients increased in the labeled datasets via the use of active learning (Table 2). In the initial training set (Round 1), 29% of patients had housing insecurity, compared to 45% and 64% in the added patients in Rounds 2 and 3, respectively. Similarly, the use of active learning led to increases from Round 1 to Rounds 2 and 3 in the proportion of Hispanic/Latino patients (14% to 19% to 27%), patients not requiring an interpreter (92% to 94% to 96%), and patients with chronic conditions.
Model performance statistics
The selective prediction model for food insecurity had an AUC of 0.84, with high sensitivity (0.9), specificity (0.95), PPV (0.9) and NPV (0.95) (Table 3). The overall undetermined rate for predicting food insecurity was 59%, the positive undetermined rate was 67%, and the negative undetermined rate was 54%.
Test set performance of classification models for food and housing insecurity. Note: positive undetermined rate = UP/(TP+FN+UP) and negative undetermined rate = UN/(TN+FP+UN), where UP (undetermined positives) is the number of positive observations labeled as undetermined; TP+FN+UP represents the total number of positive observations in the training dataset; UN (undetermined negatives) is the number of negative observations labeled as undetermined; and TN+FP+UN represents the total number of negative observations in the training dataset).
The selective prediction model for housing insecurity had an AUC of 0.81, with moderate sensitivity (0.5), high specificity (1), high PPV (1) and high NPV (0.89) (Table 3). The overall undetermined rate for predicting food insecurity was 65%, the positive undetermined rate was 82%, and the negative undetermined rate was 54%.
Validation of predicted food insecurity status
The associations between food insecurity and demographic and clinical variables were similar in the unlabeled set compared to the labeled set. In the unlabeled set, patients with predicted food insecurity were more likely to be male (47% vs 41%, p=0.027), to be Hispanic/Latino (48% vs 24%, p<0.001), to need an interpreter (33% vs 11%, p<0.001), to have a higher BMI (23 vs 20, p<0.001), and to have diabetes (34% vs 12%), hypertension (43% vs 11%), heart disease (12% vs 0.7%), and stroke (3.3% vs 0.6%) (p<0.001 for all) (Table 4). This is similar to the labeled set, where patients with predicted food insecurity were more likely to be Hispanic/Latino (46% vs 34%, p<0.001), to need an interpreter (31% vs 16%, p<0.001), to have hypertension (37% vs 17%, p<0.001), and to have heart disease (11% vs 4.2%, p=0.008) (SI Table 2).
Comparison of demographic and clinical characteristics between patients with (TRUE) and without predicted food and housing insecurity in the unlabeled dataset. Patients for whom the model abstained from making a prediction are marked as “Undetermined”.
In the unlabeled set, patients with predicted housing insecurity were more likely to be male (73% vs 34%, p<0.001), to be Hispanic/Latino (60% vs 17%, p<0.001), to need an interpreter (40% vs 6.5%, p<0.001), and to not have diabetes (0% vs 27%, p=0.005) and hypertension (20% vs 31%, p<0.001) (Table 4). In the labeled set, patients with predicted housing insecurity were more likely to be male (59% vs 40%, p<0.001) and have higher BMI (26 vs 24, p=0.003) (SI Table 2).
Overall, the geographic distribution of food and housing insecurity was similar between the predicted dataset (unlabeled) and the labeled dataset. The zip codes with highest prevalence of predicted food insecurity were 95076 (48.8%), 94303 (25%), and 95020 (23.8%) (SI Table 3). In the labeled dataset, the zip codes with highest prevalence of predicted food insecurity were 95023 (60%), 95076 (60%), and 94303 (52.6%) (SI Table 4). Out of the 10 zip codes with the highest prevalence of food insecurity in the labeled set, 9 appeared in the top 10 zip codes with the highest predicted prevalence in the unlabeled set (SI Table 3,4).
The zip codes with highest prevalence of predicted housing insecurity were 95112 (16.7%), 95126 (16.7%), and 94588 (14.3%) (SI Table 5). In the labeled dataset, the zip codes with highest prevalence of predicted food insecurity were 94025 (62.5%), 95125 (55.6%), and 94080 (50.0%) (SI Table 6).
Discussion
Our study demonstrates the feasibility of applying NLP models to unstructured clinical notes to identify patients with food and housing insecurity. We report three main findings. First, our selective prediction models had strong discriminative performance on a naive test set, indicating that they were able to accurately identify patients with food and housing insecurity. Second, the use of active learning via uncertainty sampling enriched the training datasets for data points that the model had trouble learning (instances of food and housing insecurity). Third, the predicted food insecurity status in large, unlabeled datasets yielded associations with demographic, geographic, and clinical variables that aligned both with associations in the labeled dataset, and with associations reported in the literature.
These findings suggest that this model is an accurate and efficient way to obtain SDOH information from an EMR. For example, we found that patients predicted to experience food insecurity were significantly more likely to have diabetes, hypertension, and stroke. This information could be utilized by physicians to inform an optimal treatment plan based on a patient’s specific social risks. Moreover, this data gives healthcare systems and quality improvement analysts a new opportunity to implement specific screenings and programs for the vulnerable populations identified by the model. In addition, little is known about the intersection of multiple SDOH and the impact on health, which this modeling approach could be used to better understand. While this study focused on food and housing insecurity, this model could be applied to numerous other SDOH, including substance use status, exposure to violence, disability status/access needs, transportation barriers, financial strain, among others.22,23 Our model makes previously difficult to quantify data on SDOH available to be addressed, which is an important first step towards achieving health equity.
Selective prediction is appropriate for use cases where not making a prediction is preferable to making an incorrect prediction. We previously showed how selective prediction can be applied for semi-automated cohort selection, where it is preferable to not make a prediction (and have a human manually abstract a chart, for example) than to incorrectly include or exclude a patient from a cohort.17 Here, we found that by not making a prediction on 40-50% of charts (undetermined rate), the model PPV was >95% on the remaining charts. The benefit of selective prediction via utility-based thresholding, as applied here, is that researchers can explicitly define their preferred trade-offs between making a prediction vs. abstaining.
Using active learning allowed us to identify high-yield charts for rapid model development. Active learning has been previously shown to lead to gains in labeling efficiency.24 Notably, both food insecurity and housing insecurity models identified groups of charts for active learning that tended to have higher event rates than the initial training sets. This suggests that the most “uncertain” charts were more likely to be events, and that using active learning helped the models more quickly learn these charts. We can approximate the impact of active learning on abstraction efficiency by comparing the number of events added through active learning with the number of charts required to abstract the same number of events through random sampling. For example, the food insecurity model added 55 + 48 = 103 events out of 182 total charts over the course of two rounds of active learning. Using the 29% event rate in the initial training set, it would have required abstracting 103 / 0.29 = ∼355 charts using random sampling to add the same number of events to the training data. This corresponds to a 51% reduction in the amount of abstraction required after applying active learning.
Our study is unique in its application of NLP models to unlabeled data and subsequent validation approaches. Although we did not abstract ground-truth labels for the unlabeled set, we showed that patients’ predicted SDOH status was correlated with demographic, geographic, and clinical variables in a manner consistent with both the labeled set and prior literature. For example, food insecurity is more prevalent in Black and Latino/Hispanic neighborhoods, which our model predicted.25 Similarly, food insecure adults in the U.S. are two to three times more likely to have diabetes than adults who are food-secure, consistent with our results 34% diabetes among food-insecure patients versus 12% diabetes among food-secure patients as predicted by the model.26,27 Lastly, patients experiencing housing insecurity were more likely to be from East Palo Alto and Salinas, which is consistent with literature showing the highest rates of homelessness in the San Francisco Bay Area are in Santa Clara County, where East Palo Alto and Salinas are located. These relationships suggest that our model-predicted SDOH variables recapitulate meaningful associations with demographic, clinical, and geographic variables.21 We note that the concordance of these associations was stronger for the food insecurity predictions than for housing insecurity prediction, suggesting that the housing insecurity variables may require further validation.
Future work could include accessing longitudinal data to understand chronic disease progression associated with SDOH factors and determining risks for chronic condition exacerbations. This would be particularly useful for clinicians in helping to determine the best treatment options for patients given their SDOH factors. For example, the model could be used to determine the likelihood of a person’s hypertension worsening given that they are food insecure. Additionally, it has been documented that several SDOH factors cluster and are associated with patient demographic characteristics.30 Future applications of the model could identify compound risk for chronic disease progression of multiple social determinants of health.
Limitations
First, our training dataset utilizes a cohort of patients from a single institution for a one month period, which may have limited generalizability. Additional work should look to engage academic and non-academic healthcare institutions or publicly available data sources. Second, our model does not use a lexicon or ontology for identification, which may lead to a more comprehensive set of eligible patient charts. Third, manual abstraction focused on both a history of and active food and housing insecurity, which may have been complicated by uncertainties in clinical note entry (eg. an individual’s use of food stamp programs who does not consider themselves to be food insecure). Additionally, some chart data indicating food and housing insecurity might have been inaccessible because the formatting of the chart made it unreadable. Fourth, our model may be under-estimating the number of patients with food or housing insecurity given existing low-rates of capture for these variables in EMR.28 Since the cohort was limited to Stanford Healthcare, it is possible that the practices to obtain and report SDOH patient information within this system could vary from other healthcare practices. Fifth, we used a function of predicted probabilities from logistic regression for active learning, even though this metric is not strictly a measure of model uncertainty. Sixth, we did not conduct inter-rater reliability analyses for manually abstracted data, although abstractors were trained to follow a consistent definition and document unclear cases. Seventh, because the training dataset was enriched for events through active learning, the event rate in the unlabeled dataset might have been lower, resulting in miscalibrated predicted probabilities. Eighth, zip code data on patients with housing insecurity might not reflect their true zip code of residence as a result of unstable housing. Finally, the validation of chronic conditions was done using ICD codes which previously have been shown to have high specificity, but low sensitivity in determining true disease status.29
Conclusion
Our study indicates that NLP is a feasible strategy to identify food and housing insecurity in patient EMR from clinical notes. Unlike previous studies, our model incorporates active learning and selective prediction, which allows for increased performance and training efficiency. Further, we used a unique tripartite validation strategy, which could be adopted to future studies using NLP to identify unstructured variables in the EMR. Our model showed strong test set accuracy and the predicted SDOH variables recapitulated known associations with demographic, geographic and clinical variables. We demonstrate this NLP approach to extract SDOH from patient EMR can successfully identify vulnerable populations and their specific risks, informing areas for healthcare system and policy intervention. Future studies should explore how NLP can be used to assess SDOH factors and risks for chronic condition progression and exacerbations, as well as the health effects of interventions to address SDOH needs.
Data Availability
All data needed to evaluate the conclusions are present in the paper and in the Supplementary Materials. The datasets generated analyzed during the current study are not publicly available due to patient privacy but are available from the corresponding author (JHC) on reasonable request.
Supplementary Information
Performance metrics of test set for rounds 1, 2, 3 in predicting whether a patient is food insecure.
Comparison of demographic and clinical characteristics between patients with and without food insecurity in the labeled dataset (Round 3).
This table contains the top 10 zip codes with the highest proportion of food insecurity produced by the unlabeled dataset. Zip codes are listed in descending order with the first zipcode as the highest proportion of food insecurity.
This table contains the top 10 zip codes with the highest proportion of food insecurity produced by the labeled dataset. Zip codes are listed in descending order with the first zipcode as the highest proportion of food insecurity.
This table contains the top 7 zip codes with the highest proportion of housing insecurity produced by the unlabeled dataset. Zip codes are listed in descending order with the first zipcode as the highest proportion of housing insecurity.
This table contains the top 10 zip codes with the highest proportion of housing insecurity produced by the labeled dataset. Zip codes are listed in descending order with the first zipcode as the highest proportion of housing insecurity.
Features and corresponding coefficients selected by the final food insecurity and housing insecurity L1-regularized logistic regression models.
Footnotes
Email: akshay325{at}gmail.com
Email: wkumar2{at}stanford.edu
Email: ivlopez{at}stanford.edu
Email: edwardt3{at}stanford.edu
Email: ujwal{at}stanford.edu
Email: wwang28{at}stanford.edu
Email: mclary{at}stanford.edu
Email: wvd{at}stanford.edu
Email: jgshaw{at}stanford.edu
Email: ogevaert{at}stanford.edu
Email: jonc101{at}stanford.edu
↵* Indicates equal contributions
Funding Statement: No funding was received.
COI Statement: AS reports stock ownership in Roche (RHHVF).
JHC reports royalties from Reaction Explorer LLC; consulting fees from National Institute of Drug Abuse Clinical Trials Network, Tuolc Inc, Roche Inc; and payment for expert testimony from Younker Hyde MacFarlane PLLC and Sutton Pierce.
Final approval: All authors
Ethics approval: Ethics approval was granted through Stanford University IRB (#65834).