Identifying robust biomarkers of infection through an omics-based meta-analysis

A fundamental problem for disease treatment is that while antibiotics are a powerful counter to bacteria, they are ineffective against viruses. To ensure a given individual receives optimal treatment given their disease state and to reduce over-prescription of antibiotics leading to antimicrobial resistance, the host response can be measured to distinguish between the two states. To establish a predictive biomarker panel of disease state we conducted a meta-analysis of human blood infection studies using Machine Learning (ML). We focused on publicly available gene expression data from two widely used platforms, Affymetrix and Illumina microarrays, and integrated over 2000 samples for each platform to develop optimal gene panels. On average our models predicted 80% of bacterial and 85% viral samples correctly by class of infection type. For our best performing model, identified with an evolutionary algorithm, 93% of bacterial and 89% of viral samples were classified correctly. To enable comparison between the two differing microarray platforms, we reverse engineered the underlying molecular regulatory network and overlay the identified models. This revealed that although the exact gene-level overlap between models generated from the two technologies was relatively low, both models contained genes in the same areas of the network, indicating that the same functional changes in host biology were being detected, providing further confidence in the robustness of our models. Specifically, this convergence was to pathways including the Type I interferon Signalling Pathway, Chemotaxis, Apoptotic Processes, and Inflammatory / Innate Response. Amongst and related to these pathways we found three genes, IFI27, LY6E, and CD177, particularly prevalent throughout our analysis.

treatment given their disease state and to reduce over-prescription of antibiotics leading to 23 antimicrobial resistance, the host response can be measured to distinguish between the two 24 states. To establish a predictive biomarker panel of disease state we conducted a meta-25 analysis of human blood infection studies using Machine Learning (ML). We focused on 26 publicly available gene expression data from two widely used platforms, Affymetrix and 27 Illumina microarrays, and integrated over 2000 samples for each platform to develop optimal 28 gene panels. On average our models predicted 80% of bacterial and 85% viral samples 29 correctly by class of infection type. For our best performing model, identified with an 30 evolutionary algorithm, 93% of bacterial and 89% of viral samples were classified correctly. 31 To enable comparison between the two differing microarray platforms, we reverse engineered 32 the underlying molecular regulatory network and overlay the identified models. This revealed 33 that although the exact gene-level overlap between models generated from the two 34 technologies was relatively low, both models contained genes in the same areas of the 35 network, indicating that the same functional changes in host biology were being detected, 36 providing further confidence in the robustness of our models. Specifically, this convergence 37 was to pathways including the Type I interferon Signalling Pathway, Chemotaxis, Apoptotic 38 Processes, and Inflammatory / Innate Response. Amongst and related to these pathways we 39 found three genes, IFI27, LY6E, and CD177, particularly prevalent throughout our analysis. 40 Author summary 41 Bacterial and viral disease require specific treatments, and whilst there are various treatment 42 options for specific infection types, rapid diagnosis and identification of the optimal 43 treatment remains challenging. Even in wealthier countries with developed healthcare 44 Introduction 58 The varying differences within both classes of bacterial and viral infections cause the body to 59 respond in a distinct way (1). Bacteria can be countered by pathways such as complement-60 mediated lysis, and the cell-mediated response for those that survive phagocytosis and live 61 within the cell (intracellular bacteria). In this response, cells present bacterial peptides 62 (antigens) on their surface, which are identifiable by Helper T cells that mediate bacterial 63 destruction (2). There are a large variety of viruses and bacteria that affect the host's immune 64 system in various ways. Whilst some response pathways may overlap for bacterial and viral 65 infections, there are however a number key differences (3,4). In fact, these different response 66 pathways cause varied transcription (expression) of key genes and are the medium for 67 distinguishing disease state based on the host's transcriptional response (5). 68 large amounts of irrelevant features, often referred to as the curse of dimensionality (26). This 119 high dimensionality is especially pronounced in the case of gene expression data with the 120 total human gene set being ~20,000. 121 Various feature selection procedure exist and have been demonstrated in biological problems 122 (24). For this study we focused on Backwards Elimination (BW) (27) forming a well-123 established benchmark, and an evolutionary algorithm, a more explorative and 124 parameterizable search approach, to obtain reduced model feature sets (17). BW essentially 125 searches for the optimal feature set by progressively eliminating the least important features 126 from a given dataset and testing whether the new model is significantly more accurate than 127 the previous. Whereas evolutionary algorithms are based on evolving population(s) of models, 128 which are repetitively intermixed, and subject to random point mutations. This evolutionary 129 process is assumed to produce converging model populations in terms of performance and 130 their associated feature sets (28). 131 The application of different computational pipelines often leads to different outcomes in 132 disease prediction (29). We believe, it is thus important not only to present performance 133 statistics for one given model generated by an ML pipeline, but to explore the underlying 134 biological response of a set of plausible models. By doing so, it is possible to develop a more 135 robust biomarker panel (mitigating overfitting which would generally produce models hard to 136 interpret biologically), and to understand why a given model, or set of similar models, are 137 valid. 138 In this work, we have performed a meta-analysis over publicly available transcriptomics data 139 (human blood samples where individuals had bacterial, viral or no infection), from two 140 microarray technologies (Affymetrix and Illumina). We applied feature selection and 141 machine learning for biomarker discovery and predictive model generation, and lastly we 142 explored the biological context of the resulting models by reverse engineering the underlying 143 networks. Representing omics data as a network, has several key benefits. One can often 144 better represent many complex systems as connected components, and the genome is no 145 exception (30). Clustering is one popular method to explore these complex networks and 146 many algorithms exist to reveal insight into these complex structure (31). Visualising a 147 clustered network allows us to explore aspects of this generative process, and how feature 148 selection unfolds over it. However, network construction can often be sensitive to the 149 computational approach and parameterization applied (32, 33). In our approach, we validated 150 our findings and mitigate any potential bias in network generation and clustering by 151 illustrating that the biological driven feature selection is consistent across two separate 152 networks, containing different studies, and derived from different technological platforms. 153 samples with lower levels of study confidence (?s) is merged by common genes and pre-processed by

Materials and Methods
Step 1.
Step 1 outputs a combined and batch corrected dataset (B), where only b/v/c samples are present. Two instances of (B) are formed, one where samples of lower levels of support are integrated into b/v classes, and the other completely omitting uncertain samples. Feature selection is performed on data B in Step 2 using (i) Backwards Elimination, and (ii) an Evolutionary algorithm.
Step 2's output is a number of Gene Lists (C) obtained in the feature selection. Data B is also used to infer and cluster a gene interaction network, by (i) reverse engineering the gene interaction network, and (ii) clustering the adjacency matrix. (D) is then formed as the clustered interaction network overlaid with genes found in the best performing mode of each dataset and search procedure.
To identify and validate a panel of biomarkers able to differentiate bacterial and viral 155 infections, we performed a meta-analysis of GEO gene expression data, all from open source 156 microarray human blood infection studies. Our analysis was divided into three major method 157 steps: i) pre-processing, ii) feature selection, and iii) inferring a gene interaction network, to 158 discover and our validate gene lists (Fig 1). Following the major steps, we performed and 159 report the results of a final out of sample test on data not previously used in the training phase 160 for greater validation. 161

Pre-processing 162
Data. Datasets from four technological platforms (two from Affymetrix platforms and two 163 from Illumina platforms), consisting of 3868 samples, from 21 different studies, were 164 included in the analysis (Table 1). These datasets were selected from a wider pool identified 165 in an initial scan of online databases, based on a variety of factors including: microarray 166 platform manufacturer (most prevalent platforms -Affymetrix and Illumina) and study set 167 size (larger studies with more predictive power), class pathogen strain distribution (aiming 168 for an equal distribution across the data); and ability to merge with other datasets in our 169 analysis. 170 Table 1. Summary of platform level Affymetrix and Illumina datasets prior to pre-processing.

171
M a n u f a c t u r e r P l a t f o r m Deduplicating genes by probes and merging datasets. Dataset columns (originally 172 microarray ProbeIDs) were first deduplicated and substituted by their gene mappings. Where 173 duplicate ProbeIDs existed for the same gene we selected a representative ProbeID with the 174 highest average intensity across samples (34). Samples from datasets of the same 175 manufacturer were then merged by common genes, first at the level of studies within the 176 same platform, then by platforms in the same manufacturer. 177 Batch correction and evaluation. Batch corrections targeted two non-biological sources of 178 systematic variation: (i) inter-platform study batch effects (differences between platforms), 179 and (ii) intra-platform batch effects (differences between studies within a batch). Batch 180 correction was implemented with 'ComBat' (16) in a two-step sequential batch correction 181 pipeline (S1 Appendix.docx). We repeated this process for both Affymetrix and Illumina 182 datasets separately to form batch corrected Affymetrix data, and batch corrected Illumina 183 data. 184 Batch correction was verified to retain biological variation and remove technical variation 185 using two validation steps (Fig 1 Step 1). Firstly, we tested whether pre and post batch 186 correction significant features overlapped significantly. Secondly, we performed Principle 187 Component Analysis (PCA) (35) visualising the data in two dimensions and comparing the 188 PCA plots of before and after batch correction. For a successful batch correction, pre-batch 189 correction sample clustering in the PCA would be visually removed in the PCA plot of post 190 batch corrected data. 191 Dealing with study sample ambiguity: forming a confirmed and integrated dataset 192 instance. To include more data, including some class ambiguity in the original studies, we 193 formed a modelling dataset which integrated bacterial and viral samples with lower levels of 194 confidence (b?, and v?)( Table 1). This integrated dataset contained only classes labeled b/v/c 195 (Fig 1). For Affymetrix this formed (Affy_I) and similarly, for Illumina this formed 196 Illumina_I. Two additional datasets of confirmed sample classes only, were also generated 197 and included in the study but presented only in the Appendix. 198 Dataset Preparation. For dealing with un-even class distributions present in our data (Table  204 1) we employed two strategies. Firstly, we used a study aware data split which insured 205 relatively equal class proportions across both training, test and evaluation data splits. 206 Secondly, we ensured that classification accuracy bias due to larger class proportion of 207 disease states was minimized by weighing smaller classes correspondingly higher (18). This 208 ensures that our model will not be biased to classifying samples with a larger proportion in 209 the dataset. 210 Backward Elimination. We operated on a 60/20/20 training/test/evaluation data split for 211 each dataset processed in BW (37). On each training set we ran 240 BW search procedures, 212 using Out-of-bag (OOB) error as the minimisation criterion and implementation using the 213 VarSelRF R package (38). Each run generated a single optimal model which minimised OOB. 214 For each dataset a single representative model was selected from the 240 runs which 215 maximised accuracy on test data. 216 Genetic-algorithm. The Genetic-Algorithm (GA) optimized approach is an efficient method 217 for creating suitable multivariate models. We used the R library GALGO (17) to identify a 218 small feature model by continuously crossing a number of small feature models 219 (chromosomes of features) with each other, hypothetically identifying better models with 220 successive generations. We used an initialised fitness goal of 0.95, model size (chromosome 221 size) of 15 genes, and k-fold cross-validation to counter overtraining. In the RF, larger classes, 222 namely viral, were also penalized, as to ensure equal predictions across classes. After 250 223 models, we generated a representative model through a frequency based forward selection 224 strategy which ensures only genes that contributed to predictions are included in the final 225 model (S2 Appendix). 226

Inferring underlying interaction network 227
We reverse engineered gene regulatory networks using ARACNe (39) which builds an 228 adjacency matrix of genes with their mutual information from expression data (Fig 1). These 229 networks allow identification of functional relationships between genes and their 230 corresponding products (40, 41). In addition, they can provide insight into the functionally 231 relevant groups of genes for distinguishing disease state, by examining locations of RF 232 selected genes. 233 To select significant interactions within our dataset we used a p-value threshold < 0.05 in the 234 ARACNe procedure. The approach can then estimate a mutual information threshold that is 235 relevant for the provided dataset and a specified p-value. With our data this resulted in a 236 threshold of MI > 0.0176 to be retained. From the gene pairs of mutual information, we 237 formed an edge table which was the basis for our interaction network. Nodes are genes and 238 edge weights are the mutual information between two genes, where greater mutual 239 information would suggest a stronger relationship. We then loaded our networks in 240 Cytoscape (42) which visualises molecular interaction networks and has support for a number 241 of clustering algorithms. 242 To identify highly interconnected sub-networks within our reconstructed regulatory network 243 we utilised the Cytoscape clustering plugin GLay (32). GLay uses an implementation of the 244 Girvan-Newman Edge-betweeness algorithm (43) which we used to split our networks it into 245 clusters of connected genes. This resulted in a number of smaller sub-networks and allowed 246 us to inspect their functional roles within the larger network. We then mapped higher level 247 ontologies, such as pathways and gene ontology from gene symbols and used the DAVID (44)  248 tool to provide enrichment analysis. The enrichment analysis looked at several different 249 ontologies, providing an indication of overrepresentation, which we used to infer the likely 250 biological function of a given cluster. Each cluster analysis generated an enrichment table 251 detailing enriched ontology terms along with enrichment ratios and (adjusted) p-values. From 252 the enrichment table we then produced a dotplot which depicted enrichment ratio, p-value 253 and gene count, along with a colour scheme denoting different ontologies, for visual 254 interpretation. 255 For clusters of genes with enriched and significant terms related to the immune response, we 256 labelled them manually as Functionally Relevant (FR) clusters. These FR clusters allowed us 257 to make inferences about which biological functions hold predictive power, by overlaying 258 model selected genes onto our labelled gene regulatory network. 259

Out of sample testing 260
Out of sample testing usually refers to testing a model on data not previously seen in model 261 training and selection (37). Whilst a validation set was held back for both Affymetrix and 262 Illumina data, the validation data still contained samples from the same manufacturer and 263 group of studies used in training. Hence, within the original 'discovery dataset', gene lists 264 could still be overfit to some non-biological effect persisting in either the manufacturer 265 technology or set of studies present, which was not removed by batch correction. 266 To properly test generalisability and investigate any discovery data bias, we evaluated the 267 best performing models discovered on both Affymetrix and Illumina data by retraining and 268 testing them on non-discovery data (Affymetrix Gene Lists to Illumina Data, and Illumina 269 Gene lists to Affymetrix Data). These non-discovery datasets contained samples from 270 different studies and technology and therefore represented the ideal validation datasets. With 271 similar error between discovery and non-discovery data one can be confident that models 272 have not overfitted to a given dataset and are suggested to be generalisable. 273

Pre-processing 275
Gene de-duplication and data merging was successful for both Affymetrix and Illumina. In 276 the final Illumina datasets 19,947 distinct genes were found intersecting all studies, whereas 277 for Affymetrix Data we found 13,383 (Table 2). This lower Affymetrix count was due to 278 platforms GPL571 and GPL9188 having only 13,383 distinct genes ( Table 1) Affymetrix platforms were successfully merged and combined via our batch correction 282 pipeline, indicated by non-significant changes in differentially expressed (DE) genes and 283 removal of clustering in our PCA analysis between both study and platform batch corrections 284 (S1 Appendix). Illumina based datasets were represented by a single platform, GPL10558. 285 Batch correction did not result in significant changes to DE genes and removed the 286 previously observed clustering by study (S Fig 2). 287 This resulting two datasets Affy_I and Illumina_I contained 1676 and 1892 samples 288 respectively (Error! Reference source not found.). It is evident there is an uneven class 289 distribution present in both datasets. Both Affy_I and Illumina_I are made up of more than 50% 290 viral samples (66.89% and 56.50%, Table 2). The most underrepresented class is bacterial 291 samples, with both datasets comprising fewer than 20% samples labelled as bacterial (Table  292 2). 293

Biomarker lists 297
Running GA and BW on both Affymetrix and Illumina generated an ensemble of models for 298 each method-datasets pair. For BW this was an ensemble of optimal models, one per run of 299 the algorithm. For GA this was the evolved chromosomes obtained by repeats of the search 300 procedure. From this ensemble of models, we computed relative gene selection frequencies 301 (top 16 genes displayed Table 3). 302

308
BW search procedures in both technologies converged to a small set of genes, indicated by 309 high relative selection rate calculated by the number of times a gene was selected across the 310 multiple runs performed in each optimisation procedure. For Affymetrix 14 were included at 311 a rate of 1.0, whereas for Illumina BW results contain 12 genes at a rate of 1.0 (Table 3). 312 GA's on the other hand contained a much wider gene selection in the evolved chromosome, 313 in both search procedures only a single gene was included at a relative rate of 1.0. which 314 reflects the more varied selection in GA search procedures. 315 Overall search results (aggregated between runs by frequency) from BW and GA in both 316 Affymetrix and Illumina all contained LY6E (Lymphocyte antigen 6E, UniProt: Q16553) 317 amongst their 9 most frequently selected genes (Table 3). Amongst the next widely selected 318 genes were IFI27 (Interferon alpha-inducible protein 27, mitochondrial, UniProt: P40305) 319 and IFI44 (Interferon-induced protein 44, UniProt: Q8TCB0), both in the top 16 by gene 320 selection frequency for three of the four search procedures (Table 3) To further investigate gene convergence, we compared the relative model gene inclusion 334 rates for all search procedures together. We scaled each model gene frequency (between 1 335 and 25), then plotted them together as a stacked bar plot. Fig 2 shows  being represented in all search procedures. However, interestingly IFI27 is also included 339 amongst all search procedures. Furthermore CD177, a neutrophil-specific receptor and 340 known to be at increased expression for patients in septic shock (52, 53), was selected 341 relatively frequently and present in all search procedures. 342 One interesting aspect to look at is the intersection of this between Genes frequently selected 343 between Affymetrix and Illumina generated models. We identified 88 genes intersecting 344 between Affymetrix and Illumina (S1 Table) and performed functional enrichment analysis 345 of them using DAVID. We found both highly enriched and significant terms relating to the 346 immune response. Included in the list of significant pathways was, in order of significance 347 'Antiviral defense' is the most significant term, whilst 'type I 'Antiviral defense' comprising of 12 genes, the 'type I interferon signalling pathway' which 348 included 10 genes, and 'Immunity' encompassing 17 of the 88 genes intersecting between 349 Affymetrix and Illumina search procedures (Fig 3). 350 For each search procedure we obtained a final representative model (Affy_BW, Affy_GA, 351 Illumina_BW, and Illumina_GA) and evaluated its performance on a held-out data split. 352 Model performance was recorded as the size of the gene list and its class-based performance 353 in terms of: Balanced Accuracy, Sensitivity, Specificity, and Mcnemar's Test p-value which 354 tests for consistency in responses and can reveal bias to classifying a certain class (all metrics 355 derived from the evaluation data split) (54). 356  (Table 4) (Table 4). 370

Inferred interaction networks 371
We inferred the underlying gene regulatory networks for both Affymetrix and Illumina 372 datasets, but present here the analysis on the larger Illumina network (Affymetrix analysis in 373 S3 Appendix). GLay clustering of the gene interaction network initially revealed 14 clusters 374 containing more than 10 genes (Fig 4). To enable a more granular analysis of specific 375 network sections (those indicated to be functionally relevant in the immune response (FR) as 376 indicated by enrichment analysis, or containing genes selected by our models) we further 377 partitioned several of the initial clusters, forming a network hierarchy (limited to a depth of 3). 378 This resulted in 110 distinct groups of genes which we analysed (Table 5). 379 In Illumina 24 of the 110 clusters had enriched and significant terms related to functions of 380 the immune system in our DAVID analysis (Table 5). Of these 24 FR clusters, 10 had at been 381 selected by at least one Illumina model. These 10 clusters contained 55 genes in the union of 382 Illumina models (68% of all 81 Illumina selected genes in the network). Additionally, a small 383 number of clusters (four) were selected by every model. 384   HERC5, IFI44, IFI44L, IFI6, IFIT1, IFIT2, IFIT5, ISG15, MX1, OAS2, OAS3, RSAD2, RTP4,427 SAMD9,SPATS2L,and TMEM123).

Cross manufacturer gene list performance 451
We evaluated each of the BW & GA representative models from Affymetrix on the Illumina 452 Data and Illumina Models on the Affymetrix data. Contrasting each model's performance 453 between these two discovery and non-discovery datasets we get the performance results 454 depicted in Fig 5. This figure shows the difference between overall accuracy, and class-based 455 accuracy, speciality and sensitivity when generalising our models to data pertaining from a 456 different technology and set of studies. 457 In terms of overall accuracy (Fig 5 A) Affymetrix models, both GA ad BW, performed worse 458 when applying to the Illumina data. However, the drop was less than 0.1 for both Affymetrix 459 based performance in terms of Balanced Accuracy, Sensitivity, and Specificity. For each performance measure, bars are grouped by model, and each bar refers to the difference between performance on the original dataset (which each model was discovered on) and the performance on the data it had not been exposed too. For Affymetrix models this would contrast the performance on the Affymetrix data, with the same model's performance on the Illumina data. GA and BW. Whereas for Illumina, both GA and BW models slightly gained accuracy when 460 applied to the Affymetrix data (0.04 and 0.05 respectively). 461 Looking specifically at bacterial performance (Fig 5 B), both Illumina models performed 462 worse on the Affymetrix data in terms of bacterial balanced Accuracy (BW_I 0.71 and GA_I 463 0.73 2dp). Whereas the Affymetrix models performed well on the Illumina data (BW_I 0.89 464 and GA_I 0.89 2dp). In terms of bacterial specificity there was little change for all models, 465 staying within +/-0.05 2dp of change in performance. However, in terms of bacterial 466 sensitivity, the Illumina models performed particularly worse on the Affymetrix data (BW_I 467 0.44 and GA_I 0.47 2dp). 468 Across viral class specific metrics (Fig 5 B), no model had any large change in Balanced 469 Accuracy (change < 0.05 2dp). The largest metric change was seen in sensitivity, with 470 Affymetrix models slightly decreasing, but with an original score of 0.97 and 0.95 for BW_I 471 and GA_I they are still performing well when ran on the Illumina data. 472 Overall, both Affymetrix and Illumina models performed well given that data was pertaining 473 from different manufacturers and different groups of studies. Particularly stability around 474 viral performance suggests a robustness to the gene lists for classifying viral samples 475 correctly. However, given that bacterial performance change was very comparable to viral, it 476 too suggests a strong ability to classify bacterial samples, even when moving out of the 477 original dataset. 478

480
Due to the amount of relevant data, we focused our analysis on studies from two of the 481 largest microarray platforms, Affymetrix and Illumina. Whilst these both determine the 482 expression levels of genes and are common in large-scale population studies, differences in 483 quantification and normalisation of gene expression values create technical difference (58). 484 Studies within manufacturers were successfully batch corrected, indicated by non-significant 485 changes in differentially expressed genes and removal of sample clustering by studies and 486 platforms in PCA analysis. However, the combination of studies between manufacturers was 487 unsuccessful, leading to two parallel analyses on the combined and batch corrected versions 488 of (i) Affymetrix and (ii) Illumina datasets which minimized biological variation loss. 489 Simpler solutions are more specifically justifiable and allow for greater interpretation, which 490 is the motivation for feature selection amongst models in biological data. We employed two 491 feature selection algorithms using the Random Forest Classifier over our Data: Backwards 492 Elimination and GALGO -both essentially cutting the noise and finding the most significant 493 biological variation responsible for predicting disease state. It is unknown without a brute 494 force search whether a truly optimal combination of genes has been found, however both BW 495 and GA approaches converged around a small group of genes located in uncorrelated and 496 functionally separable clusters. Models were found to be strongly enriched for the ISGs. In 497 fact, IFI27 and LY6E (both ISGs) were included in all Affymetrix and Illumina models. IFI27 498 is involved in various signalling pathways affecting apoptosis (59-61). Whereas, LY6E 499 belongs to a class of interferon-inducible factors that broadly enhance viral infectivity (62). 500 LY6E has also been attributed a diverse set of effects, including attenuating T-cell receptor 501 signalling (63) and suppressing responsiveness to Lipopolysaccharide which stimulate 502 immune responses (64). Moreover, IFI27 was shown by Tang et al. to be a single-gene 503 biomarker that discriminates between influenza, and other viral and bacterial infections in 504 patients with suspected respiratory infection (65). However, this single-gene biomarker 505 approach lacks generalisability and robustness when predicting a more varied pathogen set. 506 As we have observed, performance in our meta-analysis was greatly improved by including 507 more genes in our models. By inferring the underlying interaction network, we discovered that convergence was not 519 only happening to a set of genes, but also, and more prominently, convergence was focusing 520 around particular groups of functionally similar genes. This gene-group convergence only 521 emerged as part of an in-depth investigation into the driving forces of feature selection from a 522 biological network perspective. When representative members of these uncorrelated gene 523 clusters are taken together, they can form highly predictive gene lists. With the ability to 524 define the host response to viral and bacterial infections, genes of our identified clusters are 525 likely good at approximating key functions important in disease state prediction. Notably, the 526 four functional groups of genes were indicated to be: Type I interferon-inducible genes 527 (ISGs), Chemotaxis genes, Apoptotic Processes genes, and Inflammatory / Innate Response 528 genes, which were prevalent in every model (both Affymetrix and Illumina). Within this 529 cluster convergence we found a highly selected group of genes to be ISGs (the most frequent 530 between both Affymetrix and Illumina models). This is no surprise, given Type I Interferons 531 serve as a link between the innate and adaptive immune systems (67) and have a broad range 532 of effects on both innate and adaptive immune cells during infection with viruses, bacteria, 533 and parasites (47). Their varying sensitivity to particular forms of pathogens is likely why a 534 number can be used in conjunction for classification with RFs. While ISGs exact function are 535 not fully understood, it appears our RF models have identified their strong connection to 536 disease state (68, 69). Whilst convergence was prominent around four functional groups of 537 genes, we also note that both in Affymetrix and Illumina, a greater more variable set of 538 functional gene groups were used in addition within our gene lists. Hence, there is a degree of 539 variability in gene solutions, and it seems there is an interchangeable portion of our gene lists 540 in which a number of genes from uncorrelated functional groups of genes can be used to 541 achieve high performance in defining disease state. 542 Finally, we verified our gene lists for generalisability by retraining and evaluating on data 543 from a different manufacturer to which they were discovered in (Affymetrix Gene lists to 544 Illumina and Illumina Gene lists to Affymetrix). It is apparent that all gene lists tend to do 545 better on Affymetrix data, regardless of which set they were discovered on, which suggests 546 that the dataset, not the gene lists, is influencing performance. Hence, we have uncovered the 547 differentiating biological signatures underlying able to define bacterial and viral infections. 548 549 550

551
Our meta-analysis of Affymetrix and Illumina human blood infection data has revealed 552 several panels of genes which are able to distinguish well between bacterial and viral 553 infections. The difference in technology and gene coverage between Affymetrix and Illumina 554 did not allow for a direct integration in our analysis. However, we were able to confirm that 555 convergence was occurring independent of the technology, to both the same genes and the 556 same functional groups of genes. This technology independent differentiable signal is 557 learnable, and we demonstrated its presence by reconstructing the underlying regulatory gene 558 network and overlaying models from the two datasets. 559 560 Acknowledgments 561 We thank all the contributing studies for generating and making publicly available their 562 respective datasets. We also gratefully acknowledge DSTL (www.gov.uk/dstl) for providing 563 support. 564 This work was also supported by the Chem-Bio Diagnostics program contract HDTRA1-12-565 . 592