Abstract
Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware transformer-based models are required to extract and classify these relations into specific relation types. No comprehensive LSF–disease RE system existed, primarily due to the lack of a suitable corpus for developing it. We present LSD600, the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5,027 diseases and 6,930 LSF entities. We evaluated LSD600’s quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multi-label RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications.
Introduction
Diseases are influenced by both lifestyle and genetic factors (1–4). Substantial evidence suggests that targeted lifestyle interventions can significantly enhance disease prevention and management in precision medicine (5,6). Considering the growing importance of lifestyle factors (LSFs) in disease, attention has been placed towards mining the literature for such relations (7–9). Despite these efforts to extract and centralize scientific knowledge, the focus is almost exclusively on the relation between diseases and nutrition, and the relations between other LSFs and diseases remain largely unconsolidated and scattered across the scientific literature. To mine LSF–disease relations from the scientific literature, the first step is to detect mentions of these entities in text. Then a straightforward approach to identify pairs of related diseases and LSFs is co-occurrence-based Relation Extraction (RE). While this will allow the detection of related entities, the nature of their relation will remain unknown. Although this is a non-issue for gene-disease relations (10), as their nature is mostly causal, the case is completely different when it comes to LSF–disease relations. For example, ‘exposure to radiation’ is well known to be harmful to health (11), causing diseases like skin cancer (12). At the same time, in modern medicine, radiotherapy, which also exposes the human body to radiation, is a common method for treating cancer (13). Therefore, simple RE approaches cannot capture the complexity of LSF–disease relations and it becomes essential to develop more sophisticated approaches that can distinguish between different types of relations — from simple statistical associations (either positive and negative), to more concrete causal, or preventive relations.
The advent of transformer-based language models such as BERT (14) and RoBERTa (15) has revolutionized the field of RE, including domain-specific areas like biomedical RE (16). However, to optimally use these transformer-based models, dedicated corpora for fine-tuning are required. The focus of the BioNLP community regarding RE corpora for disease entities has been primarily on disease–gene (17,18) or disease–chemical (BC5CDR) relations (19). When it comes to LSF–disease relations, only two relevant corpora exist, namely the ND (Nutrition and Disease) corpus (7) and the FoodDisease corpus (9). However, there is a notable lack of comprehensive corpora that cover a broader range of lifestyle factors beyond nutrition and different types of LSF–disease relations, highlighting the need for dedicated datasets for training LSF–disease relation extraction models.
In this study, we introduce LSD600, the first corpus specifically focused on LSF–disease relations. LSD600 consists of 600 abstracts annotated with LSF–disease relations, encompassing 1900 relations covering eight different relation types and nine categories of lifestyle factors (20). We have used LSD600 to train a transformer-based model on the multi-label LSF–disease RE task, which achieved an F-score of 68.5% on our held-out test set. We further validated the model on two independent external corpora of nutrition–disease relations and saw comparable performance. This model represents a first important step in extracting LSF–disease relations from biomedical literature, which can form the basis for knowledge graphs.
Materials and Methods
LSF–disease relation corpus
Document selection for annotation
Given that most of the over 37 million scientific articles available in PubMed do not pertain to LSF–disease relations, this project requires a rigorous selection process to identify relevant abstracts where such relations are expected to appear.
We started with the 200 abstracts from the LSF200 corpus (20), which were pre-annotated with a wide spectrum of LSF named entities. However, since this corpus was compiled to develop NER systems for LSF detection, many did not contain disease mentions or LSF–disease relations.
We therefore selected 400 additional abstracts from PubMed, which were required to mention at least five LSF and five disease mentions, based on using the JensenLab Tagger (https://doi.org/10.1101/067132) and publicly available LSF (20) and disease (10) dictionaries. Considering the uneven distribution of LSFs in the scientific literature (20), with terms related to nutrition being more common than terms from other categories, we selected at least 30 documents for each LSF category. The final corpus consists of 600 abstracts and is named LSD600.
The Relation Types Schema
To capture the spectrum of relations between LSFs and diseases we defined eight different relation types:
Statistical Association: Statistically significant association between two entity types, requiring the existence of an appropriate control group and the implementation of a statistical test.
Positive Statistical Association: Subclass of Statistical Association where a positive effect is clearly stated.
Causes: Subclass of Positive Statistical Association where causality is clearly implied.
Negative Statistical Association: Subclass of Statistical Association where a negative effect is clearly stated.
Controls: Subclass of Negative Statistical Association where a beneficial impact of the LSF on the disease is stated.
Prevents: Subclass of Controls where the LSF hinders the disease from occurring.
Treats: Subclass of Controls where the LSF has a therapeutic effect on the disease.
No Statistical Association: Relations where the absence of statistical association is clearly stated in text.
All relation annotations are non-directional, since in associative relations the direction is undefined, while in relations of impact it is self-evident. For more details and examples on each relation type, please refer to our annotation documentation on Zenodo and Figure 1.
Each line and associated color corresponds to a different type of relation: (1) Statistical Association, (2) Positive Statistical Association, (3) Causes, (4) Negative Statistical Association, (5) Controls, (6) Prevents, (7) Treats, and (8) No Statistical Association. For each relation type, we show an example as it looks in the BRAT RapidAnnotation Tool.
Named entity and relation annotation
We pre-annotated LSD600 with disease and LSF entities using the dictionary-based JensenLab tagger (https://www.biorxiv.org/content/10.1101/067132v1). As LSF200 is a corpus with LSF annotations, we only used the tagger to add disease pre-annotations to those abstracts. All pre-annotations were subsequently manually checked and corrected, and relations between them were manually annotated on all 600 abstracts. We systematically annotated equivalent entities, which is essential for evaluation purposes (21). Figure 1 illustrates entity and relation annotation using an example for each relation type.
Manual annotation process and corpus evaluation
High-quality, consistent annotations are a prerequisite for good performance when fine-tuning deep learning classifiers, hence a clear set of guidelines is crucial (22). Our annotation guidelines were formed and updated during a first round of Inter-Annotator Agreement (IAA) after three researchers individually annotated 30 abstracts and discussed their inconsistencies. A second round of IAA was performed with a fourth annotator, who is an expert on the field of lifestyle factors, to further update and solidify the final guidelines. To evaluate the quality of the guidelines and the corpus, an F1-score was calculated.
We annotated relations across sentences and made sure to include both direct and indirect relations between entity types. For a statistical association to be annotated, it must be statistically significant and the existence of an appropriate control group must be stated. To determine if an association is positive or negative, we considered both textual descriptions and any stated odds ratios, risk ratios, hazard ratios, or coefficient values. Hypothetical statements are not annotated as either.
The BRAT Rapid Annotation Tool (23) was used to perform all the annotations.
Transformer-based relation extraction
Relation extraction pipeline
The transformer-based system employed for relation extraction was adapted from a previously developed binary relation extraction tool, which has proven effective for RE in past applications (24,25). As described in these earlier papers, we frame RE as a multi-label classification task, aiming to determine which relation types (if any) exist for a pair of candidate named entities in the text. The system uses an architecture comprising a pre-trained transformer encoder and a decision layer with a sigmoid activation function. It can leverage pre-trained language models and accepts training, validation, and prediction data in both BRAT standoff and a custom JSON format, supporting extensive hyper-parameter optimization. Evaluation metrics are computed after each training epoch for hyper-parameter tuning. The system is trained for a predefined number of epochs, choosing the model weights that gave the highest F1-score.
As no separate sentence boundary detection is used, the system can train on and predict cross-sentence relations at the document level. To inform the model about which pair of named entities in the text to predict relation labels for, the entities are replaced by “unused” tokens from the model. This masking approach prevents the model from learning based on the actual entities rather than the surrounding context. For additional details on data preprocessing and the pipeline, please refer to (24)
Experimental setup
In our experiments, we partitioned the LSD600 corpus into distinct training, development, and test sets. The system was trained on the training set, and hyperparameters were optimized by grid search to obtain the best F1-score on the development set. The test set was used only once for prediction and evaluation of the best model, after determining the optimal hyper-parameters.
Validation on external corpora
In addition to evaluating our model on the held-out test set from LSD600, we validated it on two external corpora, namely the ND (Nutrition and Disease) corpus (7) and the FoodDisease corpus (9).
An overview of these corpora is provided in Table 1.
The relation types in these corpora are not completely aligned with LSD600. To calculate performance of our model on these corpora, we mapped the relation types of the model to their relation types. The ND corpus uses only Related versus Unrelated labels. We collapsed all subclasses of Statistical Association in our hierarchy into one category, which can be mapped to Related. No Statistical Association along with no extracted relation is considered as Unrelated. For the FoodDisease corpus, the existing relation types are called Cause and Treat; however, these do not correspond to our classes with the same name. Instead, we collapsed all subclasses of Positive Statistical Association and mapped it to Cause. Similarly, we collapsed all subclasses of Negative Statistical Association and mapped it to Treat. Since we do not retrain the model, we evaluate it on the entirety of both corpora.
Results and discussion
Corpus statistics
As its name suggests, LSD600 comprises 600 abstracts annotated with LSF–disease relations. Unlike most existing RE corpora, we do not limit the annotations to intra-sentence relations, with 16% of relations in LSD600 being cross-sentence. This should be compared to the less than 5% observed in several other biomedical RE corpora (24–28). Despite the inherent difficulty of cross-sentence relation annotation, the final IAA F1-score was 82.1%, demonstrating that the annotation quality of LSD600 is high.
The 600 abstracts that make up the corpus were randomly partitioned on the document-level into a training set (60%), a development set (20%), and a held-out test set (20%). The corpus contains a total of 1900 manually annotated relations, which are distributed over eight relation types that are organized into a hierarchy. The breakdown of LSF–disease relations across corpus partitions and relation types is shown in Figure 2. The by far most frequent relation type was Positive Statistical Association (32%), which reflects that the literature primarily focuses on lifestyle risk factors and that it is harder to demonstrate a causal relation.
The left side of the figure shows our relation schema, emphasizing superclasses and subclasses. The bar chart on the right side shows the number of examples for each relation type in the corpus. We distributed the corpus into train:development:test sets in a 60:20:20 ratio, with each bar showing the breakdown for each relation in the different sets.
The 1900 relations in LSD600 involve 5,027 diseases and 6,930 LSF entities. Figure 3 provides an overview of the distribution of relations per abstract, while Figure 4 shows the distributions of disease and LSF mentions per abstract. Additional details are available in Supplementary Tables 2,3 and 4.
The bar chart shows how many abstracts contain how many annotated relations in the training, development, and test sets.
(a) Bar chart showing how many abstracts contain how many disease mentions. (b) Bar chart showing how many abstracts contain how many LSF mentions. Bars are divided into train, development, and test sets.
In 276 abstracts, no relations were annotated. This outcome can be primarily attributed to two factors: First, our guidelines specify that we only annotate relations that provide solid, factual information. Second, there are 138 abstracts with no annotated disease mentions. 137 out of these 138 abstracts originate from the LSF200 NER corpus, which focused on LSFs and did not require any diseases to be mentioned. Only 29 abstracts have no LSF mentions. This too is primarily because of LSF200, in which documents were selected randomly from specific journals. However, a portion of abstracts without entity mentions come from the additional 400 abstracts, not just LSF200. Despite requiring at least five LSF and five disease mentions pre-annotated by the JensenLab Tagger, some abstracts have zero mentions after manual correction of false positives.
Relation extraction
System evaluation
We fine-tuned the RoBERTa-large-PM-M3-Voc-hf model on the development set using a grid search of the parameter combinations in Supplementary Table 5. After each epoch, the evaluation is checked, and only the best performing epoch is reported. The best hyperparameters were Maximum Sequence Length (MSL)=180, Learning Rate (LR)=5e-5, and Maximum number of training epochs=75. This resulted in 75.5% precision, 65.2% recall, and 70.0% F-score on the development set. The best model was used for the final prediction on the held-out test set, yielding 77.3% precision, 61.6% recall, and 68.5% F-score. The performance on individual LSF–disease relation types is shown in Figure 5.
Marker size corresponds to the number of examples in the corpus. The dotted lines represent different F-score contours. The size of each circle represents the number of relations in the corpus, and the color corresponds to each of the eight different relation types.
Manual error analysis
We manually classified the errors made by the best model on the test set in seven categories (Table 2). The most prevalent category of errors is Cross-sentence relations, which refers to any error that crosses sentence boundaries. The model fails to extract 61 of the 77 such relations in the test set, which accounts for more than a quarter of all errors and 40% of the FNs. This problem is not primarily due to the maximum sequence length of the model — which only 8 examples exceed — but is rather due to cross-sentence RE being inherently hard for current models (29,30).
Convoluted text excerpt refers to sentences that are written in a way that makes them challenging to understand also for a human. Co-reference resolution errors are closely related to convoluted text excerpt errors, but refer specifically to cases where determiners (e.g. this, its) cause difficulty in identifying which entities are being discussed. Ambiguous keywords refers to sentences in which the model likely got confused by words/phrases that can have multiple meanings. For example, in the sentence “Dietary management is important for those with celiac disease”, the word “important” does not give an unambiguous clue to the relation type. Rare keyword errors are caused by words/phrases that have a clear meaning but for which there were insufficient training examples for the model to learn them. Model errors comprise wrong predictions that do not belong to any of the categories above. For example, the model could be asserting hypotheses as a fact or simply misinterpreting perfectly comprehensible phrases.
Lastly, we found only 19 annotation errors where manual inspection revealed that the model’s predictions are correct and the corpus annotations were wrong. Correcting these, and recalculating the performance metrics increases the F-score by 2.1%.
Another way to understand the errors made by the model is to look whether different relation types get confused with each other, and if so, which ones get confused. The vast majority (82%) of FNs are not caused by label confusion, but are simply completely missed by the model. Almost half of these are cross-sentence relations. Similarly, 63% of the FPs are relations predicted by the model where there should be no relation.
Relation types being confused with each other accounts for 55 out of 227 (24%) of all errors (FP+FN). The relations more commonly confused are Prevents being predicted as Negative Statistical Association, Statistical Association being predicted as Causes, and Causes being predicted as Positive Statistical Association.
Although accurate relation extraction at the most specific level of granularity is ideal, a RE system is already useful if it can identify whether a relation exists or not and whether it is positive or negative. At this level of granularity, only 8 of the 55 remain and the F-score on the test set improves by 4.9% from 68.5% to 73.4%.
Validation results on external corpora
To further validate the RE system, we extended our evaluation beyond the held-out test set from the LSD600 corpus to include the only publicly available corpora for lifestyle factors: the Nutrition and Disease (ND) corpus and the FoodDisease corpus. Despite being limited to nutrition and not being perfectly aligned with our corpus in terms of relation types, these corpora provided a valuable benchmark for assessing the generalizability of our system.
To use our RE system to make predictions on these corpora, we mapped the relation types from LSD600 to the relation types in each of the two other corpora (see Methods for details). For the ND corpus, the RE system achieved 92.0% precision, 27.3% recall, and 42.2% F-score. To understand the very low recall, we performed an error analysis. Most of the FNs turned out to be relations that should not be annotated according to our guidelines. For example, the ND dataset included hypothetical statements and relations between diseases and metabolic biomarkers. When excluding the FNs that fall outside the scope of our annotation guidelines, the recall is 57.5% and the F-score is 70.7%.
For the FoodDisease corpus, the RE system achieved a precision of 94.3%, and a recall of 54.8%, resulting in an F-score of 69.3%. Similar to the ND corpus, an error analysis was performed to exclude false negatives due to out-of-scope relations. This led to a 70.7% recall and 80.7% F-score.
The excellent precision achieved on both corpora indicates that the relations identified by our system as LSF–disease relations are perfectly accepted by the standards of these other corpora. These performance results are especially promising considering that we did not train our model on these external datasets but used the model trained on LSD600.
Conclusions
In this study, we present LSD600, the first of its kind, manually-curated, LSF–disease corpus for relation extraction. We believe LSD600 will be an invaluable resource for the development and benchmarking of RE methods. In contrast to existing corpora, LSD600 encompasses all aspects of LSFs and annotates eight different relation types between these LSFs and diseases. Our annotations are also not limited to sentence boundaries in the text and include 16% cross-sentence relations. Additionally, we present an LSF–disease RE system trained using the LSD600 corpus, which achieved an F-score of 68.5% on the held-out test set. We further validated the model on external corpora, achieving promising performance. This RE system can be a valuable tool for extracting LSF–disease relations and has many downstream applications, given the crucial role of LSFs in disease onset, development, and management.
Data Availability
The LSF600 corpus and the fine-tuned transformer-based RE system are available under open licenses. The LSF600 corpus, including annotation guidelines, can be accessed on Zenodo at https://zenodo.org/records/12804856. The implementation source code for the RE system is available on GitHub at https://github.com/EsmaeilNourani/LSF_Disease_RE.
Data Availability
The LSF600 corpus and the fine-tuned transformer-based RE system are available under open licenses. The LSF600 corpus, including annotation guidelines, can be accessed on Zenodo at https://zenodo.org/records/12804856. The implementation source code for the RE system is available on GitHub at https://github.com/EsmaeilNourani/LSF_Disease_RE.
Conflict of interest
The authors declare that they have no conflict of interest.
Funding
This work was supported by the Novo Nordisk Foundation [NNF14CC0001, NFF17OC0027594]. K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie [101023676].