A Systems Biology Workflow to Support the Diagnosis of Pyrimidine and Urea Cycle Disorders

Background: Pyrimidine (and purine) metabolism provides essential high-energy vehicles, universal throughout all life, which carry both activated metabolites that serve as fuel and building blocks for countless metabolic processes as well as the messenger molecules that steer these processes. Furthermore, these nitrogen-containing structures encode the alphabet needed to construct the genetic code. The pyrimidine pathway overlaps with the Urea Cycle through a common metabolite (carbamoyl phosphate), with the urea cycle being responsible for the production of several amino acids and removing ammonia. Various Inherited Metabolic Disorders (IMDs) disrupt these pathways, with considerably large phenotypic variation in affected patients, hindering diagnosis. Moreover, the biomarkers for these disorders overlap substantially between the IMDs and their underlying biological pathways, interfering with identifying the deficient protein. We developed a framework that allows us to combine clinical and theoretical biomarkers with pathway models through network approaches and semantic web technologies to support the diagnosis of pyrimidine and urea cycle IMDs. Methods: We integrated literature and expert knowledge into machine-readable pathway models for pyrimidine and urea cycle disorders, including disease information and relevant downstream biomarkers. The theoretical change for each biomarker per disease was compiled based on a manual database search. The top three pathways of interest were retrieved through semantic web technologies, by selecting pathways that covered most unique markers as well as showing overlap with the theoretical marker data. These pathways and the corresponding clinical data were visualized through network analysis. Two expert laboratory scientists evaluated our approach by diagnosing a cohort of sixteen previously diagnosed patients with various pyrimidine and urea cycle disorders, based on a visualization of theoretical biomarker overlap with patient data, as well as top three pathways displaying the relationships between the individual biomarkers. Results: The number of relevant biomarkers for each patient varies greatly (five to 48), and likewise the pathways covering most unique biomarkers differ for equivalent disorders. The two experts reached similar conclusions regarding the diagnosis of nine patient samples without knowledge about clinical symptoms or sex. For the remaining seven cases, four interpretations pointed in the direction of a subset of disorders, which could be prioritized for further investigation. Three cases were found to be undiagnosable with the data available. Conclusion: The presented workflow supports the diagnosis of several IMDs of pyrimidine and urea cycle pathways, by directly linking biological pathway knowledge and theoretical biomarker data to clinical cases. This workflow is adaptable to analyze different types of IMDs, difficult patient cases and functional assays in the future. Furthermore, the pathway models can be used as a basis to perform various other types of (omics) data analysis, e.g. transcriptomics, metabolomics, fluxomics. Data Availability: The data used in this study have been deposited at https://github.com/BiGCAT-UM/IMD-PUPY .


Introduction
Many enzymes are critically involved in the synthesis, degradation, and transport of molecules implicated in metabolic processes [1]. Consequently, genetic disorders leading to deficiencies and malfunctioning of these enzymes can result in a lack of or (potentially) toxic levels of metabolites. In addition, inherited enzymatic disturbances can affect other sections of the pathway(s) in which the protein is active [2]. Collectively, these diseases are called Inherited Metabolic Disorders (IMDs) or Inborn Errors of Metabolism [3]. The limited therapies which are available [4] can only be initiated after a timely and accurate diagnosis. Currently, the diagnoses of IMDs are based on both symptoms and biomarkers, which are measured in various bodily fluids. Methods to detect genetic variants such as Whole Exome Sequencing are used to diagnose IMDs as well, although so far have been found less sensitive and less specific as compared to metabolic measurements [5]. Furthermore, genetic profiles of patients can also contain Variants of Uncertain Significance, which can only be classified as (likely) pathogenic when genomic, transcriptomic, proteomic, metabolomic, and fluxomic data are integrated through pathway or network analysis [6][7][8]. Unfortunately, diagnosing IMDs using these approaches is challenging due to the commonly observed overlap between their biomarkers, since the individual compounds are often involved in more than one pathway and can therefore be metabolized to various products. This issue is especially relevant for pyrimidine disorders, which often present with nonspecific clinical symptoms and a lack of a clear genotype-phenotype correlation [9][10][11]. The symptoms related to urea cycle disorders are often more specific (e.g. hyperammonemia, lethargy, vomiting, coma) [12]. The pyrimidine pathway synthesizes thymine, cytosine and uracil, DNA and RNA building blocks, and other molecules pertinent to signal transduction and energy transport. The urea cycle removes ammonia from the body and is a direct upstream pathway for pyrimidine metabolism, with several metabolites converted to amino acids, mainly taking place in the liver. This overlap leads to elevated metabolic levels in both pathways for certain IMDs [13]. Even though recent advances in clinical biomarker measurements [14] in urine samples have aided in diagnosis for some IMDs, most markers are currently not used for newborn (dried blood spot) screening [15], due to limited detectability of the biomarkers in these spots and with most pyrimidine IMDs missing treatment options.
This paper provides a framework using Systems Biology approaches to integrate data related to pyrimidine and urea cycle IMDs, including metabolic reference and patient values, theoretical biomarkers, and biological pathway information. Semantic web technologies are used to accurately retrieve relevant pathways for network data visualization, providing an overview of the processes relevant to the patient-specific deficient protein. With this approach, the attention progresses from individual markers to changes at the process level, which enables linking biological pathway knowledge to clinical cases. This direct link shows which metabolic reactions are disturbed, which proteins are related to these reactions, and potentially which specific protein is impaired, aiding diagnosis. Furthermore, metabolic disturbances can be recognized which cannot be attributed directly to the disorder, revealing potential blind spots in existing clinical knowledge. The presented workflow also highlights chances for the IMD field as a whole, by showcasing that improving data and identifier harmonization increases machine readability and integration of clinical data with pathway knowledge and biomarker information. Furthermore, the automation of the workflow could aid in the diagnostic process of IMDs and is adaptable to analyze different types of IMDs and functional assays in the future. Figure 1 shows the proposed workflow to connect clinical data to pathway models and theoretical biomarker data, using semantic web and network analysis. The information captured through the pathway models was pivotal in connecting clinical data with biomarker data. Several data harmonization steps were needed, before arriving at the final data interpretation step. Also, knowledge from various databases had to be integrated within the workflow, which are summarized in Table 1. All data processing steps which could be automated were captured in an RMarkdown script [16] in the R programming language [17] (version 4.1.0), tested through RStudio [18] (version 1.4.1717), available at https://github.com/BiGCAT-UM/IMD-PUPY .  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint chromatography with the mass analysis capabilities of mass spectrometry.

Workflow
PathVisio [22] PathVisio is a free open-source pathway analysis and drawing software that allows drawing, editing, and analyzing biological pathways. 3 UniProt [23] UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. 3 Rhea [24] Rhea is an expert-curated knowledge base of chemical and transport reactions of biological interest -and the standard for enzyme and transporter annotation in UniProtKB.

OMIM [25]
Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available. 3 WikiPathways [26] WikiPathways is a database of biological pathways maintained by and for the scientific community.
WikiPathways SPARQL Endpoint [27] The semantic web format of the WikiPathways database, using the Resource Description Framework (RDF) as data format which can be queried through the SPARQL-query language.

5
HMDB [29] The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.

5, 6
Cytoscape (REST API) [30,31] Cytoscape is an open-source software platform for visualizing complex networks and integrating these with any type of attribute data. cyREST is a language-agnostic, programmer-friendly RESTful API module for Cytoscape, allowing for programmatic access to Cytoscape features.

8
BridgeDb (Cytoscape App and webservice) [30,32] BridgeDb is a framework to map identifiers between various biological databases, built into other tools such as the Cytoscape app, and available through a webservice.

KEGG [33]
KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism, and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

Clinical Biomarker Data
All patient data were collected through two targeted chemical assays in urine -one for purine and pyrimidine (PUPY) and one for amino acids (AA) -as described previously [14,21], and were reported in μmol/mmol creatinine. All reference metabolic data were collected from patients suspected of having an IMD, with no apparent IMD as assessed by selective metabolic screening. For analysis of PUPY, 4853 samples were selected over ten years, and for AA analysis 1872 samples were selected over five years; these values characterize the 95% interval limits describing the lower and upper limit of distribution in a control population. The metabolites in the reference data were annotated with corresponding ChEBI [19] identifiers (IDs) or Wikidata [20] IDs (when no ChEBI ID was available), see Data/Reference_values_PUPY.txt and Data/Reference_values_Urea.txt on GitHub. The reference data was categorized based on the following age ranges: 0 up to 1 year, 1 up to 5 years, 5 up to 16 years, and over 16 years old. The reference data from the overarching category 0 to 16 years was used for several PUPY biomarkers if no reference value was available for a specific age category (Data/Reference_values_PUPY_noNAs.txt). Patient data was anonymized and data regarding treatment (allopurinol, which is metabolized in the liver to the active form oxypurinol) was removed from the dataset, as well as argininosuccinic acid anhydride (ASA-anhydride) which became obsolete after switching the separation method from anion exchange chromatography to UHPLC-MS/MS for AA analysis [21]. Four patients were disregarded from this study, due to missing data for the AA panel (patients labeled B, C, P and Q). Because most IMD disorders are quite rare, the collected data did not include a large sample size regarding ethnicity, age range or sex, and can therefore not be regarded as representative for the worldwide patient population. Table 2 details the sample size, diseases, and corresponding age ranges used in this study. No patients have objected to the anonymous use of their left-over material from routine diagnostics for laboratory development and validation purposes. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Pathway models
Machine-readable versions of the purine, pyrimidine, and urea cycle metabolic pathways were created using the pathway editor and curation tool PathVisio (version 3.3.0) [22], as well as pathway models (PWMs) on biomarkers, visualizing several markers missing from the main pathway models. All proteins were annotated with UniProt IDs [23], which linked to potentially relevant Rhea IDs [24] for the metabolic conversion interactions; the Rhea IDs were manually curated regarding the directionality of the reaction and incorporating detailed stereochemistry on substrate and product of the reactions if available. Corresponding ChEBI IDs [19] in Rhea were used to annotate the metabolites. IMDs were annotated with OMIM disease IDs [25]. Data on the created PWMs was deposited in WikiPathways [26] and retrieved from the WikiPathways SPARQL endpoint [34] (data from September 2021 [35]). This endpoint contains PWM data in a harmonized and semantic format, the Resource Description Framework (RDF) [27].

Selection of Relevant Biomarkers
For each patient, measured biomarkers from the clinical assay data (step 2.2) were annotated automatically with the ChEBI IDs from the reference data by character vector matching, including the age-relevant reference values. Missing biomarker data (null-values) were disregarded, as well as patient or reference concentration data being equal to zero. Relevant biomarkers were detected by calculating how far the measured patient value was removed from either the lower or upper reference limits; below the lower limit indicates a negative change or decrease (values between 0 and 1) and above the upper limit indicates an increase or a positive change (values above 1). When a patient value was in between or exactly equal to the reference values, the change was designated to be one. All resulting calculated change-values were log(2) transformed to show proportional changes.

Theoretical Biomarker Data
Since the coverage of biomarkers in chemical assays was highly dependent on the chemical analysis used, all PUPY IMDs were queried through the WikiPathways SPARQL endpoint by their OMIM IDs and HGNC gene name [28] first. Second, the corresponding potentially relevant biomarkers for these disorders were retrieved manually from IEMbase is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ; IDs [29], and positive or negative concentration change. The latter was converted to a numeric scale for each provided age category: neonatal (birth-1mth), infancy (1-18mths), childhood (1.5-11yrs), adolescence (11-16yrs), and adulthood (>16yrs). If a disorder from the PWMs could not be linked to an HGNC gene name, a manual inspection was performed to include the full name and abbreviation of said disorder in the search criteria. The biomarkers were retrieved as HMDB IDs, with their reported positive or negative changed concentration information, and unified to the HMDB 4.0 structure. Correlations between individual metabolic biomarkers and diseases were visualized in a heatmap (Euclidean distance) with the gplots package (version 3.1.1, https://cran.r-project.org/package=gplots); positively changed biomarkers were colored red (using three shades to show mildly, high, and very high), negatively changed markers blue (again three shades), and markers which were not altered for a disease were colored white.

Relevant Biomarker overlap
All biomarkers were mapped manually from ChEBI IDs to their corresponding HMDB IDs to match with the theoretical biomarker data; L enantiomers were chosen over D enantiomers for the annotations when this was unclear from the metabolite name. The changed biomarkers (step 2.4) relevant for each patient were first converted to the same scale as the theoretical biomarkers (values for the log(2) change above 3 or below -3 were set at 3 and -3, respectively). Second, the patient data was added to the theoretical biomarker visualization (step 2.5) by matching on HMDB IDs, and visualized in a heatmap (with the same settings as step 2.5).

Pathway Selection
Relevant pathways for each patient were found through another semantic web query against the WikiPathways SPARQL endpoint (step 2.3), matching the remaining changed biomarkers (2.4), which have a log(2) change value not equal to zero. Relevant biomarkers which were not in any pathway model were reported. The relevant pathways were sorted based on the highest number of matching biomarkers. In total three pathways were automatically selected, based on including most unique biomarkers.

Data visualization
The data for each patient was visualized with the network analysis tool Cytoscape [37] (version 3.8.2), by using the Cytoscape REST API [31] (version v1) through R. The absolute highest value for the log(2)change was used to determine the color scale, using a five-point scale to accommodate for small changes (values between -1.5 and 1.5) and high (abnormal) biomarker values. If no value was available for a node within the network, the fill color was set to gray. The patient data were linked to the IDs in the network by matching on the "XrefID" column in Cytoscape. When this resulted in no matches, the BridgeDb App in Cytoscape [30] was used to convert the XrefID column (from either HMDB [29] or KEGG compound [33] IDs) to corresponding ChEBI IDs, relying on the BridgeDb web service for the metabolite ID matching [38].

Data Interpretation
For each patient, a network data visualization (step 2.8) and relevant biomarker overlap heatmap (step 2.8) were provided to two Laboratory Specialists Biochemical Genetics, after . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. which narrative feedback on a potential diagnosis was collected. For each patient, a Cytoscape session file (step 2.8) is available on GitHub (Cytoscape_networks folder), as well as an interpretation overview containing patient characteristics containing age category, relevant biomarkers, personalized heatmap (step 2.6), selected pathways, and clinical interpretation (Expert_interpretation folder).

Workflow
A systems biology workflow was designed to connect clinical biomarker data for IMDs to pathway models. The biomarker data from several patients were visualized through network data analysis, which will be discussed in more detail below.

Clinical Biomarker Data
Data from 16 patients with a variety of pyrimidine and urea cycle IMDs were considered to test our workflow. In total 88 markers were measured, 34 through the PUPY panel and 54 by the AA panel. Out of these biomarkers, two could not be annotated with one unified database IDs from ChEBI. The first biomarker, CysHCys, could not be annotated with one unique ID; the second marker (2,8-dihydroxyadenine) could not be found in the ChEBI database and was therefore annotated with a Wikidata ID. Treatment compounds (oxypurinol and allopurinol) or compounds not measured by the panel anymore (argininosuccinic acid anhydride) were removed, which leaves 83 potential biomarkers. Due to missing reference data, the markers n-carbamyl-aspartate (CHEBI:32814), allantoin (CHEBI:15676), cytosine (CHEBI:16040), and cytidine (CHEBI:17562) were disregarded for further data analysis. Table 3 shows the number of remaining biomarkers for each patient after connecting the data to reference data, as well as the number of altered biomarkers per patient. This table further details how many biomarkers were missing from any PWM (step 2.4), the pathway's coverage of biomarkers relevant for each patient (pathway), and the highest total of markers covered by one pathway. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Pathways Models
Since the chemical assay for pyrimidine metabolites also measures purine compounds, we created PWMs for both, as well as for the urea cycle. The final PUPY and urea cycle IMD models have been distributed through WikiPathways [26] with IDs wikipathways:WP4224 (purine), wikipathways:WP4225 (pyrimidine), and wikipathways:WP4571 (urea cycle), the latter was later extended with associated pathway information and diseases (wikipathways:WP4595). Last, two biomarker data pathways were developed, pyrimidine metabolism disorders (wikipathways:WP4584) and urea cycle disorders (wikipathways:WP4583); the biomarkers were connected to at least one reaction, either as substrate or product. Table 4 describes the content of each pathway, regarding proteins, metabolites, interactions, and described disorders. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Selection of Relevant Biomarkers
The annotated data from step 2.2 ( Figure 1) was used to find relevant pathways for visualization, which led to 171 pathways in total, including one or more distinct biomarkers. Fifteen of these pathways contain 10 or more markers, which are displayed in  Table 3, as well as the number of biomarkers captured by the highest-scoring PWM. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Theoretical Biomarker Data
Since the chemical assay for pyrimidine metabolites also measures purine compounds, we collected theoretical biomarker data for both pathways, in addition to the urea cycle. In total, 17 unique purine, 10 pyrimidine, and 8 urea cycle IMDs were included in this study (a total of 35 disorders). For these three classes of IMDs, 27 unique biomarkers were found to be relevant for the sample matrix urine. Taking available reference data into account and the availability of an HMDB ID, a maximum of 22 unique biomarkers could be linked to 23 IMDs; Figure 2 shows an example visualization for the age category 0 up to 1 year old, which correlates the disorders based on their theoretical biomarkers change, sorted on their respective IMD class. For the following IMDs, no data was available: TPMT, PRPPS, TYMP, TK2, RRM2B, ITPA, IMPDH1, DHODH, AMPD1. Three disorders in the urea cycle category -CPS1, ARG1, NAGS -and one IMD in the purine pathway -DGUOK -were indistinguishable based on the retrieved biomarkers from IEMBase. Two additional disorders -SLC25A13 (urea cycle) and NT5C3A (pyrimidine) -were the closest match to these previous four indistinguishable IMDs. The group of pyrimidine IMDs (AGXT2, UMPS, DPYD, DPYS, UBP1), urea cycle (ASL, ASS1) and purine (ADA, APRT, ADSL, ATIC) contained some discriminating values for the biomarkers, enabling distinction between these individual disorders based on only biochemical markers in theory. OTC and SLC25A15 (both urea cycle disorders) can be difficult to differentiate from one another based on theoretical biomarkers. A group of purine IMDs -XAN2, XO, PNP, HPRT1 (with two known phenotypes), PRSP1 -can be considered control cases, since these too should allow for distinction between IMDs based on theoretically relevant biomarkers. Figure 2 also shows biomarker data for patient I (bottom row in purple), who was diagnosed with DPYD, and how this data correlates with theoretical biochemical markers in urine for individual IMDs in the purine, pyrimidine, and urea cycle pathways. The sample obtained from this patient showed three additionally changed biomarkers as compared to the theoretical values, as well as a distinct correlation to the theoretical biomarker profile of DPYD.

Pathway Selection
By only querying the pathway data for relevant biomarkers (instead of all markers in a panel), a customized visualization was possible for each patient. The selection of the top three pathways containing the highest number of unique markers was performed automated. However, this selection can also be performed manually (for example by an expert with knowledge on relevant pathways), aiming to include pathways containing biomarkers with the largest change. Table 3 shows how many pathways were relevant for each patient, as . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ; well as how many unique biomarkers the highest-scoring pathway covers. Table 6 shows which pathways ended up in the top three for each patient, as well as which biomarkers were not part of the visualization after selecting the top three. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022.   Figure 3 shows the network visualization for patient I on the 'Biomarkers for pyrimidine metabolism disorders' (left) and 'Purine metabolism' (right). Two metabolites (thymine, uracil) which are directly converted by DPD show elevated levels; one direct downstream metabolite of thymine (5-OH-methyluracil) also shows a high concentration. Two downstream metabolites of DPD (dihydrouracil, beta-alanine) are found to be within the healthy reference values, whereas (S)-beta-aminoisobutyrate shows a decreased concentration. Last, the elevated level of SAICA-riboside was not expected for this disorder and might suggest immaturity of the patient.

Data Visualization
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ;

Data Interpretation
The two laboratory specialists in biochemical genetics combined the heatmap from step 3.6 and the network biomarker data visualization from step 3.8 to arrive at an IMD diagnosis. Nine disorders out of the 16 patient samples were diagnosed with the correct IMD, whereas for four patients the visualization helped to suggest further informative assays (the last column of Table 6 summarizes these results). These latter patients included one case of Dihydropyrimidine dehydrogenase deficiency (DPD -patient E), and three cases of Ornithine Transcarbamylase deficiency (OTC -patients G, R and S). Samples from patients under treatment, e.g. patient H (diagnosed with hyperornithinemiahyperammonemia-homocitrullinuria (HHH) syndrome, also known as ornithine translocase (SLC25A15) deficiency) receiving citrulline, were difficult to diagnose since the workflow did not distinguish between abnormal biomarker values due to treatment or caused by the IMD. Patient J (diagnosed with Beta-ureidopropionase deficiency, UPB1D) was not correctly diagnosed by both experts, which we attribute to the very mild disturbances in the biomarker patterns. Last, patient O (diagnosed with Ornithine Transcarbamylase deficiency, OTC), showed an unexpected higher value for arginine rather than ornithine.

Workflow
Our proposed workflow was based on combining several data analysis techniques, e.g. semantic web and network analysis. Together, these techniques created the possibility to visualize clinical biochemical assay data on the pathway level, allowing for a more detailed interpretation of the connections between the different markers. The developed workflow . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ; was designed for individual patient analysis and optimized for PUPY and urea cycle disorders with biomarkers measured through targeted assays. We believe this workflow can be used as an inspiration to construct a similar system for other IMDs, however, the following limitations should be taken into account for each step of the workflow.

Clinical Biomarker Data
First, all chemicals in the assays were annotated manually based on their (Dutch) name; using persistent identifiers to annotate data is a key aspect to enable open science [39], and ultimately lead to FAIR data [40]. This process led to two biomarkers differentiating in annotation. The first, CysHCys, is a disulfide from cysteine and homocysteine; annotating this compound was not possible based on the name provided. Drawing the chemical structure and converting the structure to a SMILES [41] could be used to annotate the compound. This conversion to SMILES could be performed in several chemical drawing programs, out of which CDK Depict (https://github.com/cdk/depict) is a free web-based version implementing the Chemistry Development Kit [42], an open-source cheminformatics toolkit. The second biomarker, 2,8-dihydroxyadenine, could not be located in the database we used for harmonization (ChEBI), however could be annotated with a Wikidata ID. We would recommend annotating chemical biomarkers with a database reference along with the name of the compound by laboratories and vendors of instruments alike. Second, since we combined data from two clinical assays (which is common practice in Clinical Genetics laboratories when a patient is suspected of having an IMD with nonspecific symptoms), we were also faced with different age categories for each assay. A solution would be to (re)structure data to allow for comparing all assays available in a lab [43]. Third, not each age category could be linked to reference data, therefore we had to substitute missing values with data from a broader age range. Several biomarkers have not been measured in enough non-patients to calculate a reliable healthy concentration range; reference data were missing for n-carbamoyl-aspartate, allantoin, cytosine, and cytidine. The first biomarker (known in HMDB [29] as ureidosuccinic acid) only has reference data available for saliva samples. The latter three compounds all have reference data available in HMDB for urine samples, however, these values were partly derived from NMR measurements, which are not directly comparable to the tandem mass spectrometer data used in this study.

Pathways Models
Creating PWMs annotated with resolvable identifiers for the entities within the pathway [44] were crucial for data analysis and visualization [45]. The biomarker data models however only incorporate the markers as substrates or reaction products, without showing the metabolites more than one step removed from the original marker. Several biomarkers of PUPY and urea cycle IMDs are numerous steps removed from the original malfunctioning protein; covering all steps up to the original affected reaction will differ for each disorder and each marker. Also, this reaction information is scattered over publications in images and text [46] as well as various databases [47], which requires dedicated curation time to arrive at a PWM covering all relevant interactions. Several initiatives merge pathway information [48,49] based on gene and protein content, rather than metabolite and chemical conversion level. Lacking automatable access to pathway content or licenses prohibiting data extraction were another hurdle in acquiring all relevant interactions for biomarkers. Last, completing the . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ; models presented in this study was a ponderous task, due to the lack of naming standardization in databases and papers alike, as also has been noted in systems pharmacology approaches [50]. This despite the IUPAC-nomenclature rules [51] and available software to translate IUPAC names to chemical structures [52] and vice versa [53].

Selection of Relevant Biomarkers
Only after dedicated manual annotations of clinical biomarkers, missing data in pathway models could be evaluated. For this study, seven out of 83 markers could not be found in any of the consulted pathway data. Due to the scattered nature of metabolic reaction mechanism knowledge in pathway databases, network approaches were difficult to implement in an automated fashion. By annotating the metabolic interactions in the PWMs with Rhea [24], information from various databases can be connected to the reaction level rather than the compound level in the future. This annotation also provides stoichiometry and charge-balanced reactions. However, the participating metabolite protonation states were based on a calculation of the charge close to physiological pH, leading to different IDs for the molecules within the PMWs as compared to the clinical assay data. The Rhea database provides an extension to the ChEBI ontology to overcome this issue, by adding the "has_major_microspecies_at_pH_7_3" connection. Another approach to overcome the mismatching of biomarkers could be by converting the clinical biomarker IDs to their corresponding neutral molecular structure InChIKey [54] ID or substructure matching [55].

Theoretical Biomarker Data
Connecting all disease IDs to their counterpart protein ID and data from the IEMbase [36] could only be performed manually. Data could be downloaded as CSV or PDF data, where only the latter includes information on the biomarkers and their positive or negative change status. The IDs of the biomarkers were a mixture of HMDB 3.0 and 4.0, which led to mapping issues. We also found that some syndromes (Lesch-Nyhan versus Kelley-Seegmiller, also known as HPRT-related gout hyperuricemia) were treated as individual disorders by one database (OMIM), while the other combines the information on both disorders in one entry (IEMBase). For another two disorders, pyrimidine 5'-nucleotidase superactivity versus pyrimidine 5'-nucleotidase I deficiency, IEMbase has two separate entries (with different biomarkers), while OMIM only includes pyrimidine 5'-nucleotidase I deficiency. To facilitate data integration and comparison on an automated basis, we advise to provide programmatic access to the aforementioned databases, for example through an API [ref] or SPARQL endpoint [ref]. We also want to encourage harmonizing the information in databases, for example by using existing ontologies such as the Human Disease Ontology [56], human phenotype ontology [57], or nosology on Inherited Metabolic Disorders [58].

Relevant Biomarker overlap
Several IMDs could not be connected to one or more metabolic biomarkers measured in urine. Furthermore, the log2(change) data of patients was converted to the -3 to +3 scale from IEMBase, where the contribution of highly altered biomarkers to the correlation might get obscured.

Pathway Selection
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 4, 2022. ; Selecting the top three pathways covering most unique biomarkers could leave out markers very distinctive for a disorder, or not take into account the amount of change; biomarkers with a high change might be more descriptive of a disorder than markers showing marginal changes. Several pathways overlap in terms of content, which could conceal potentially relevant pathways. The query to select the top three pathways covering most unique biomarkers did not take into account which proteins were catalyzing the reactions. The workflow leaves room for manual selection of potentially relevant pathways by experts, which could be aided by reviewing the patient-specific heatmap (step 6). Finally, some pathways were relatively large in terms of node size and captured reactions, leading to higher coverage of biomarkers but also to a larger distance between the individual markers, which diminishes visualizing a clear biological cause and effect path between the affected protein and the relevant biomarkers.

Data Visualization
Connecting data to pathway and network models often require a form of identifier mapping. Unifying the IDs to one database per biological type was a crucial step to avoid mismatches or one-to-many mapping issues. When a biomarker was present at the outer border of a pathway and depicted solely as a reactant, data interpretation might be hindered since this leads to dispersed data visualization. Furthermore, since not all clinical data could be visualized directly in one PWM (due to the biomarkers being spread over multiple models), other approaches will be needed to overcome the boundaries imposed by individual models (as described in step 4.3). Even though the WikiPathways RDF contains pathway data from Reactome pathways, we decided to not use these pathways in our workflow. The model conversion from the native Reactome pathway modeling language SBGN to the language behind PathVisio, GPML, leads to quite some biomarkers captured as unconnected nodes. Other possibilities to visualize pathway data in a network tool such as Cytoscape, include the Reactome Cytoscape Plugin [59] which supports automation, however has been designed for cancer and other types of diseases network analysis focussing on gene and protein content. Other pathway data was made available in Cytoscape for KEGG [60], which can be automated. Finding relevant pathways however was not possible due to a lack of an API or SPARQL endpoint. Two other pathway apps, CyPath2 [49] and cy3sabiork [61] could not be automated.

Data Interpretation
Currently, diagnostic laboratories for inborn errors of metabolism report exact metabolite concentrations in diagnostic patient reports. Pathway and network models on the contrary, report log-or z-scores and pathway names with probabilities. We believe that laboratory specialists first have to learn the interpretation of these new diagnostic tools. In this study the network model helped to easily diagnose 9 out of 16 patient samples, and pointed into the correct direction or suggested follow up analysis for 4 samples. As mentioned in the results section, patient H (diagnosed with hyperornithinemia-hyperammonemia-homocitrullinuria (HHH) syndrome) was found to have more raised biomarkers than noted in IEMbase. IEMbase mentions increased excretion of homocitrulline and orotic acid while the network visualization showed increased excretion of several amino acids in the urea cycle including argininosuccinate. The latter usually is not present in urine and as such is a biomarker for ASL deficiency. However, this patient received citrulline treatment which can lead to increased presence of argininosuccinate. This example shows that laboratory experts are needed to interpret network and pathway models and combine their knowledge on patient treatment and symptoms. For the validation of a recently implemented diagnostic tool called targeted urine metabolomics (TUM) used many of the samples discussed in this current study [43]. When comparing the data interpretation from TUM and the network and pathway workflow here, we can deduce that similar interpretations were reached. This agreement was also found for the Ornithine Transcarbamylase deficiency (OTC) cases, which remains difficult to diagnose especially in women since the disorder is X-linked causing an atypical biomarker pattern. Patient O is an example of such, where we hypothesize that the cyclic metabolic conversion of arginine, argininosuccinate and ornithine into one another could be the cause of this unexpected pattern. To understand these atypical cases of OTC and corresponding biomarker patterns, data from more patients is required. This disorder can be further investigated by measuring OTC activity in blood circulation [62], however this approach also did not seem feasible for asymptomatic heterozygotes. Other approaches entail genetic screening [63] (which can miss de novo cases) or a protein loading test, however, when this provocative test fails a detailed family history and additional molecular testing is advised [64]. For future studies, the interest should shift to measuring metabolite fluxes [2] over a longer timespan to better understand how protein intake triggers disorders [65] and if fasting could be used to treat patients with hypoglycemic symptoms [66].

Conclusions
With this study, we show the potential of a Systems Biology approach combining semantic web technologies for data linking and network analysis for data visualization, to directly connect biological pathway knowledge to clinical cases and biomarker data. Information on treatment and clinical condition remains important for accurate diagnosis, as well as expert interpretation of all information combined into this workflow. The presented workflow is adaptable to analyze different types of IMDs, difficult patient cases and functional assays in the future, which opens up the possibility for usage in the diagnostic workflow.