Implementing semantic interoperability of gene expression profiles using the HL7 FHIR standard

. The characterization of diseases using high-throughput omics technologies has increasing relevance for individualized treatment decisions. Especially gene expression profiles can capture significant molecular differences that can foster patient stratification and pave the way towards precision medicine. Electronic health records already evolved to capture genomic data within clinical systems and standards like FHIR enable sharing within, and even between institutions. However, FHIR only provides profiles tailored to variations in the molecular sequence. Although expression patterns are neglected in FHIR, they are equally important for decision making. Here we provide an exemplary implementation of gene expression profiles of a microarray analysis of patients with acute myeloid leukemia using an adaption of the FHIR genomics extension. Our results demonstrate how FHIR resources can be facilitated in bioinformatics-based decision support systems or used for the aggregation of molecular genetics data in multi-center clinical trials.


Introduction
Measuring the gene expression in patient samples provides more detailed insights into the molecular conditions of the underlying disease. Over the years, high-throughput technologies have evolved to be used in routine clinical diagnostics and foster individualized treatment. At the same time digitalization in healthcare systems advanced to electronic health records (EHR) already capturing genomic data. Interoperability and data sharing between systems and institutions gain more importance with commonly accepted standards like Fast Healthcare Interoperability Resources (FHIR) [1] as a foundation.
FHIR divides the information into modular and extensible components, as well as adapts widely established web standards and the RESTful architecture principle for the sharing of EHRs. Building upon this standard the FHIR Genomics extension enables the inclusion of genomics data, however, the included profiles are tailored to cover variations in the molecular sequence while expression patterns are neglected. Moreover, recommendations for the realization of gene expression results in FHIR are lacking. Nevertheless, these insights are important for decision support and translational research.
Here we provide a feasible FHIR implementation for gene expression profiles from a microarray analysis and demonstrate the interoperability of the resulting FHIR resources with an interactive web application.

Gene expression data
The data set examines a dose-limiting side effect in patients diagnosed with acute myeloid leukemia (AML) that are treated with chemotherapy [2]. In particular mucositis, DNA damage within the oral mucosa caused by the chemotherapy is investigated based on the derived gene expression profiles. The samples are collected from punch buccal biopsies from five AML patients pre-and post-chemotherapy, and three healthy controls for comparison.
Microarray analysis was performed using Human Genome U133 Plus 2.0 Array (Affymetrix, Santa Clara, CA) with GRCh38.p13 (Genome Reference Consortium Human Build 38, Ensembl release 99) as a reference, followed by a Robust Multichip Average (RMA) normalization of the raw data. Subsequently, Linear Models for Microarray data (LIMMA) [3] and Significance Analysis of Microarrays (SAM) [4] was applied to identified genes potentially affected by the presence of AML, or predictive of oral mucositis after chemotherapy. The authors made the data available at the EBI Expression Atlas [5] portal by the ID E-GEOD-10746.
We chose this gene expression data set because the conducted analysis represents a typical bioinformatics workflow resulting in several gene expression profiles from the same and different individuals that enable disease classification and patient stratification into risk groups [6].

Adaption in FHIR resources
The central element within FHIR to capture real-world concepts in healthcare systems is the Patient resource. The study itself evolves around patient treatment therefore all subsequent patient-specific results, and resources implementing those refer to this base element. Detailed information about the sample donors was not included in the original data set to preserve the anonymity of the participants, instead, we used artificially generated data using Synthea TM [7] to create Patient resources as reference.
The medical condition of the AML patients was captured by the Condition resource to distinguish them from the healthy donors of the control samples. The single samples are captured by the Specimen resource and serve as a link to distinguish between samples collected from the same patient, namely pre-and post-chemotherapy.
The gene expression values are generated based on the GRCh38.p13 reference genome and were measured for each sample. Since all gene expression profiles use the same reference database, the single genes contained in the reference genome were included as MolecularSequence resources. The actual expression values are treated as single measurements that are stored as Observation resources with the Observation-geneticsGene extension referring to the corresponding gene symbol. The mapping . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 15, 2022. ; details of the data to properties of MolecularSequence and Observation resources are presented in table 1. Although the Patient resource is referenced directly within the Observation resource, the Specimen resource is still required to differentiate between the different samples of the same patient. Using the MolecularSequence resource as a reference avoids redundancy in common genomic information and simplifies the retrieval of the gene expression values for one particular gene across the different samples. An overview of the references between the resources is shown in figure 1.
An in-house installation of the dockerized HAPI FHIR server [8] was used for storing the created resources. We developed a web application that uses the FHIR REST API to retrieve and display the FHIR resources to demonstrate a minimalistic decision support system. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 15, 2022. ; https://doi.org/10.1101/2022.02.11.22270850 doi: medRxiv preprint

Data and material availability
All necessary software to reproduce the results is publicly available on GitHub at https://frankkramer-lab/gene-expression-on-fhir. This includes scripts to download the used data sets from the official platforms, the setup of the HAPI FHIR server as a docker container and data import and the hosting of the web application. Additionally, the python code for data processing and uploading to the FHIR server, as well as the source code of the web application is provided within the repository. Our code is licensed under the open-source GNU General Public License Version 3 (GPL-3.0 License), which allows free usage and modification for anyone.
Secure hosting of a public FHIR back end was not possible, therefore a demonstration of the web application with hard-coded excerpts of the FHIR resource data is hosted as a static service using the GitHub pages functionality which can be accessed at https://frankkramer-lab.github.io/gene-expresssion-on-fhir.

Results
The original data translated to 252,684 resources stored on our FHIR server. For performance improvements, not all gene ids in the reference genome (60,617 ensemble entries) were encoded in FHIR but only those present in the gene expression data. A detailed overview of the created resources and the time requirements is shown in table 2. The web application demonstrates the usage of the created resources obtained directly from a FHIR server.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 15, 2022. ; The web application demonstrates the usage of the created resources: Those are obtained directly from the FHIR server, then linked and assembled into a visual representation of the gene expression across the patient samples ( figure 1).

Discussion
Through our addition to the FHIR Genomics extension, we were able to include genomic profiling) data. Since only excerpts of the molecular data are necessary for detailed investigation, FHIR encoded gene expression profiles are suitable for usage in web-based applications. Furthermore, we were able to demonstrate the integration capabilities of FHIR encoded genomic profiles in decision support systems. Further improvements could consist of consolidation of the outcome of the analyses, e.g. significantly differentially expressed genes between samples, as DioanosticReport resources.

Conclusion
Our results demonstrate how FHIR resources can be facilitated for the clinical exchange of expression profiles. These resources also can be used in decision support systems or for patient assessment. The further extension of genomic features in FHIR allows the opportunity to establish the currently missing standard for aggregation of molecular genetics data in multi-center clinical trials. Within this context, patient stratification across health institutions due to omic profiling presents a complex scenario to carry out multi-center clinical trials.