Early prediction of liver disease using conventional risk factors and gut microbiome-augmented gradient boosting

Gut microbiome sequencing has shown promise as a predictive biomarker for a wide range of diseases, including classification of liver disease and severity grading. However, the potential of gut microbiota for prospective risk prediction of liver disease has not been assessed. Here, we utilise shallow gut metagenomic sequencing data of a large population-based cohort (N=>7,115) and ~15 years of electronic health register follow-up together with machine-learning to investigate the predictive capacity of gut microbial predictors, individually and in conjunction with conventional risk factors, for incident liver disease and alcoholic liver disease. Separately, conventional and microbiome risk factors showed comparable predictive capacity for incident liver disease. However, microbiome augmentation of conventional risk factor models using gradient boosted classifiers significantly improved performance, with average AUROCs of 0.834 for incident liver disease and 0.956 for alcoholic liver disease (AUPRCs of 0.185 and 0.304, respectively). Disease-free survival analysis showed significantly improved stratification using microbiome-augmented risk models as compared to conventional risk factors alone. Investigation of predictive microbial signatures revealed a wide range of bacterial taxa, including those previously associated with hepatic function and disease. This study supports the potential clinical validity of gut metagenomic sequencing to complement conventional risk factors for risk prediction of liver diseases.


63
Liver disease causes ~2 million deaths per year worldwide, approximately 3.5% of all deaths, and is 64 increasingly common in aging populations [1,2] . The aetiology of liver disease is complex and includes 65 several inter-related risk factors, such as obesity, age and excess alcohol consumption [3] . Alcohol 66 consumption, in particular, is a major contributor to liver disease, accounting for >50% of cirrhosis 67 deaths [2] . The consequences of liver disease can be acute or chronic with highly variable progression 68 rates; however, most patients are not diagnosed until an advanced stage when liver function is 69 overwhelmed (e.g. decompensated cirrhosis) [4,5] . Currently, liver biopsy remains the gold standard for 70 diagnosis and classification of disease stage, but biopsy is invasive and thus restricted. Although non-71 invasive tests for detecting liver disease are available, such as ultrasound, computed tomography, 72 magnetic resonance imaging and spectroscopy, they are primarily applicable to the detection of 73 advanced severity [6][7][8] . Hence, there is an unmet need for high fidelity early detection and risk prediction 74 approaches for liver disease. 75 The role of the human gut microbiome-the collection of microorganisms residing in the 76 gastrointestinal tract-has been increasingly recognized in various aspects of liver disease [9,10] . Interest 77 in the gut microbiome has rapidly grown as sequencing technologies have progressed from 16S rRNA 78 amplicon sequencing to shotgun metagenomics. Recent studies have revealed evidence linking gut 79 microbial composition and the pathogenesis of liver disease [11][12][13] , as well as potential therapeutic 80 approaches targeting gut microbial communities [14,15] . Importantly, the gut microbiome has shown 81 potential for the differentiating cirrhosis and non-cirrhosis controls. Qin et al. showed gene and function 82 level biomarkers derived from metagenomics could classify liver cirrhosis patients and healthy 83 controls [16] . Loomba et al. successfully distinguished advanced fibrosis from mild and moderate 84 NAFLD using gut microbiome characterized by whole-genome shotgun sequencing with random forest 85 classifiers [17] . Later, Caussy et al. used random forest classifiers to distinguish NAFLD-cirrhosis from 86 non-NAFLD healthy controls based on gut microbial compositions from 16S sequencing [18] . However, 87 previous studies have been limited by cross-sectional study design and there are limited data regarding 88 the longitudinal association between baseline microbiota and incident liver disease. This would be an 89 important step in investigating whether the gut-microbiome is causally linked to liver disease or can be 90 used as a stratification tool to identify those at high risk, who may benefit from targeted interventions. 91 Therefore, we designed a longitudinal study to examine the association and predictive capacity of the 92 gut microbiome and incident liver diseases, using shallow metagenomic sequencing and supervised 93 machine learning in a large population-based cohort of >7000 individuals with over 15 years of 94 electronic health records (EHR) follow-up. Traditional statistical and machine learning approaches are 95 compared on gut metagenomes, and their predictive capacity is evaluated individually and in 96 combination with conventional risk factors, including age, sex, body mass index, waist-hip ratio, 97 alcohol consumption, smoking status, triglycerides, high-density lipoprotein cholesterol, low-density 98 lipoprotein cholesterol, and gamma-glutamyl transferase levels. The best performing models are further 99 assessed using survival analysis for time to disease onset. Taken together, our study assesses the 100 potential clinical validity for adding the gut metagenome to conventional risk factors for prediction of 101 incident liver disease. We make our predictive models freely available (see Data Availability). 102 Table 1. To investigate the predictive capacity of baseline gut microbiome and conventional risk factors  109 for incident liver diseases, we matched phenotype metadata with gut microbial profiles derived from 110 stool samples, and linked the baseline data to follow-up diagnoses of any liver diseases (LD) or 111 alcoholic liver disease (ALD) defined by ICD-10 codes (Methods). After stringent quality control and 112 filtering (Methods), 41 cases of incident ALD and 103 cases of incident LD were considered for 113 prediction analyses. 114

BASELINE GUT MICROBIAL COMPOSITION 115
Stool samples were sequenced by shallow shotgun metagenomics to a mean depth of approximately 116 1.056 million reads per sample. After human sequences, low quality and adapter reads were removed, 117 a total of 7.63 billion reads were classified using a GTDB release 89 index database for taxonomic 118 classification, resulting in 967,000 post-QC and classified reads per sample on average. In total, GTDB 119 classification uniquely identified 151 phyla, 338 classes, 925 orders, 2,254 families, 7,906 genera and 120 24,705 species from gut metagenomes. We focused on common bacterial taxa to reduce alignment 121 artefacts and noise; taxa were filtered by relative abundance (>0.01% in at least 1% of samples), which 122 resulted in 46 phyla, 71 classes, 124 orders, 232 families, 617 genera and 1,224 species for further 123 analysis. Overall, the most abundant taxa were members of phyla Firmicutes, Firmicutes_A 124 (corresponding to Firmicutes in NCBI), Firmicutes_C (Firmicutes), Bacteroidota (Bacteroidetes), 125 Actinobacteriota (Actinobacteria), and Proteobacteria (Supplementary Figure 1). 126

127
The workflow for machine learning to predict incident liver disease is shown in Figure 1. For both 128 ALD and LD, samples were randomly partitioned based on the prediction target into a training set for 129 discovery (70% of samples) and a validation set for evaluation (remaining 30%), and the partitioning 130 itself was randomly performed 10 times to assess sampling variation. Within the training set, we 131 developed and tested prediction models through cross-validation, and the optimal models were assessed 132 for final performance in the withheld validation set (Methods). Prediction models were derived from 133 different taxonomic levels separately, since taxa at higher ranks are inclusive of their members at lower 134 ranks and introducing redundant features can lead to impaired prediction performance. The average 135 results of the 10 sample partitions are reported. 136 To define a subset of informative taxa, we performed pre-selection of microbial features associated with 137 incident liver disease from the union of three approaches in the training sets (Methods). After pre-138 selection, there were 10, 16, 42, 123, 355, 508 microbial taxa on average at phylum, class, order, family, 139 genus and species levels for incident ALD,and 9,12,25,62,194,303 for incident LD,respectively. 140 To incorporate microbial diversity measures, Chao1, Pielou's and Shannon's indices were included as 141 additional features. These microbial features were then used to build prediction models in the 142 corresponding training sets. 143 Gradient boosting classifiers were applied to pre-selected microbial features to develop and optimize 144 prediction models with cross-validation in the training datasets. To assess prediction performance, we 145 also included two robust and common statistical approaches, logistic regression and ridge regression. 146

PREDICTION OF INCIDENT LIVER DISEASE 147
The gradient boosting classifier generally outperformed both multivariable logistic regression and ridge 148 regression, particularly at lower taxonomic levels (Fig. 2). With the gradient boosting classifier, higher 149 prediction performance was observed at lower taxonomic levels for both incident ALD and LD, 150 suggesting that the strength of association for higher resolution of gut microbial features outweighs 151 their lower abundances at these levels. For LD, we obtained the highest prediction performance at 152 species level with average AUROC of 0.733 (95% CI 0.713 -0.752; Fig. 2a). At other taxonomic levels, 153 the mean AUROC for LD ranged from 0.622 to 0.725 at phylum and genus level, respectively. When 154 predicting ALD, we obtained average AUROC > 0.75 at phylum and class levels, and average AUROC > 155 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020 0.85 for other taxonomic levels with the highest value of 0.895 (95% CI 0.881 -0.909) at species level 156 (Fig. 2b). 157 Ridge regression tended to perform better than logistic regression (Fig. 2) The logistic regression yielded highest AUROC of 0.651(95% CI 0.609 -0.694) at family level and 162 AUROC < 0.60 at other taxonomic levels for predicting LD (Fig. 2a); for ALD, the best performance 163 was obtained at order level with average AUROC of 0.694 (95% CI 0.637 -0.751; Fig. 2b). Although 164 logistic regression is highly efficient and interpretable, it did not perform well in this case where a large 165 number of features are correlated. The L2 regularization of ridge regression better handled inter-166 correlated microbial features than logistic regression. However, both methods underperformed 167 compared to the gradient boosted decision tree classifier, which is known to better capture nonlinear 168 relationships and is robust to correlated features. The gradient boosted decision tree classifier was used 169 in subsequent analyses. 170

BENCHMARKING REFERENCE MODELS USING CONVENTIONAL APPROACHES 171
Conventional risk factors are commonly used for predicting liver disease risk [20,21] . We built reference 172 models using a comprehensive set of conventional risk factors, including sex, age, alcohol consumption, 173 smoking status, body mass index (BMI), waist-hip ratio (WHR), triglycerides, high-density lipoprotein 174 (HDL), low-density lipoprotein (LDL) and gamma-glutamyl transferase (GGT), to compare with the 175 prediction capacity of microbiome-based models (Methods). The conventional prediction model 176 achieved an average AUROC score of 0.768 (95% CI 0.746 -0.789) for incident LD, slightly higher 177 than the highest AUROC score of microbiome-only models achieved at species level (AUROC 0.733) 178 (Fig. 2a). For ALD, the average AUROC reached 0.875 (95% CI 0.855 -0.896), slightly lower than the 179 AUROC of gradient boosting model achieved using species-level microbial features alone (AUROC 180 0.895) (Fig. 2b). Both conventional models and microbiome-based models had substantial predictive 181 power individually; the next section evaluates the combination of conventional risk factors and 182 microbial compositions for LD and ALD prediction. 183

INTEGRATING GUT MICROBIOME AND CONVENTIONAL RISK FACTORS 184
To investigate the potential of a microbiome-augmented prediction model for liver disease, we utilised 185 the gradient boosting classifier of microbiome features together with all conventional risk factors related 186 to the disease, and followed the same partitioning for training and testing (Methods). To evaluate the 187 performance comprehensively, the optimal models were assessed for both AUROC and AUPRC. Since 188 greater taxonomic resolution offered better predictive performance, we compare the species-level 189 augmented and the conventional risk factors only models. 190 Overall, the prediction performance of the microbiome-augmented models achieved greater AUROC 191 and AUPRC compared with conventional prediction models. Prediction of LD (Fig. 3a) Fig. 3b). 197 With a baseline AUPRC value of 0.015 for LD, the species-level augmented model achieved an average 198 AUPRC of 0.185 (95% CI 0.161 -0.21), which was higher than the average AUPRC of 0.158 (95% CI 199 0.132-0.185) yielded by the conventional prediction model (Fig. 3c). For ALD with a baseline AUPRC 200 of 0.006, the species-level augmented model and conventional model achieved average AUPRC of 201 0.304 (95% CI 0.261 -0.348) and 0.199 (95% CI 0.138-0.260; Fig. 3d), respectively. 202 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

RISK MODELS 204
We next performed survival analysis using time-on-study Cox regression in the validation sets to assess 205 potential clinical validity of microbiome-augmented (species level) risk models as compared to 206 conventional risk factors only (Methods). The Cox model of conventional risk factors achieved average 207 C-statistic of 0.813 (95% CI 0.792-0.835) for LD and 0.922 (95% CI 0.903-0.940) for ALD, 208 respectively. The microbiome-augmented risk models yielded higher average c-statistic of 0.838 (95% 209 CI 0.814-0.862) for LD and 0.959 (95% CI 0.950 -0.968) for ALD. Consistent with this finding, the 210 microbiome-augmented model fits significantly better (LRT p<0.01) than that using conventional risk 211 factors only. Disease-free survival of those in the highest 5% of microbiome-augmented risk was worse 212 than those for conventional risk factors alone (Figure 4). 213

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10. 1101/2020 Desulfovibrionaceae of phylum Desulfobacterota_A (Proteobacteria) exhibit an immense amount of 253 endotoxin activity [54] . A recent study has shown that endotoxin-producers that overgrow in patients with 254 fatty liver, including strain members of Escherichia and Klebsiella, can induce NAFLD in mice models 255 and suggest a potential causative role in NAFLD [55] . The altered gut microbiota composition in cirrhosis 256 is partially attributed to reduced primary bile acids and increased secondary bile acids in the gut lumen 257 that are resulted from liver insufficiency [52] . The reduction of total bile acids in the gut contributes to an 258 overgrowth of pathobiont microbes, including members of Enterobacteriaceae and 259 Enterobacteriaceae [52] . The elevation of secondary bile acids is largely associated with an abundance 260 of bacterial producers of secondary bile acid, such as members of Clostridium and Eubacterium [52,56] . 261 Bile salt hydrolase activity is associated with resistance of hepatocytes to bile toxicity and is broadly 262 present in gut microbes including Bacteroides, Bifidobacterium, Clostridium and Lactobacillus [56] . 263

264
In this study, we investigated the potential analytic and clinical validity of the gut microbiome to 265 improve prediction of future liver disease. From baseline gut metagenomic sequencing and 15-years of 266 EHR follow up, we developed a framework to predict incident LD and ALD using machine learning 267 approaches, demonstrating that the gut microbiome and conventional risk factor models exhibited 268 similar prediction performances separately, but importantly that microbiome-augmented conventional 269 risk factor models markedly improved prediction. These results indicate that the combination of 270 conventional risk factors with gut microbiota may have potential clinical utility in early risk 271 stratification for liver disease. 272 Few studies so far have investigated the prediction of incident liver disease events using gut microbiota. 273 Currently, clinical risk prediction models for liver disease events are commonly derived from 274 demographic, lifestyle and biochemical factors resulted from routine blood tests. While these prediction 275 rules have reasonable accuracy in clinical practice, they tend to be influenced by extrahepatic conditions 276 and have reduced accuracy for early stage disease [57,58] . Furthermore, there is a lack of guidance for 277 primary care and necessity of referral based on the test results, as a large number of patients with 278 abnormal test results are asymptomatic during liver disease progression [59][60][61] . Thus, there is an urgent 279 need for new tools which improve early detection of high risk individuals. 280 Our findings are consistent with previous studies of the relationship of bacterial taxa with hepatic 281 function, disease and progression, and identified several with potential probiotic effects. However, the 282 precise role of gut microbiota is poorly understood and our results support the need for species level or 283 indeed greater levels of resolution offered by even deeper metagenomic sequencing. For example, the 284 abundance of the Bifidobacterium genus has been reported to be associated with alcoholism and liver 285 injury in various ways [30,62] : at species level, B. dentium has been found to be enriched in advanced liver 286 disease [25] , conversely B. pseudocatenulatum and B. bifidum have been recognized as potential 287 probiotics that may attenuate liver damage [33,63,64] . This indicates the importance of lower-level taxa 288 resolution in interpreting how bacteria contribute to the disease pathology. 289 Our study has several limitations. Due to the necessity of a prospective early detection study to consider 290 a large number of apparently healthy individuals, we were limited in the number of incident disease 291 cases, and therefore we are not well-powered to investigate subtypes and stages of liver disease which 292 might lead to greater clinical significance. The need for shallow metagenomic sequencing for a large 293 prospective cohort also meant that we were not able to evaluate the added information of deep 294 sequencing to risk prediction. The prevention measures available to individuals at high risk of liver 295 disease are also somewhat limited. These include weight reduction, alcohol and smoking cessation, and 296 may extend to caution with pharmaceutical prescriptions. Finally, our cohort is of European ancestry 297 and therefore likely suffers from the well-known ancestry bias of analyses performed in European 298 cohorts; thus, these prediction models are likely to have attenuated performance in non-European 299 ancestries. 300 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10. 1101/2020 Notwithstanding the challenging necessity for validation of novel biomarkers as well as development 301 of standards for interpretation as prerequisites for clinical implementation, our study provides an 302 evidence base and corresponding risk prediction models for the translation of metagenomic sequencing 303 in risk prediction of liver disease. 304

STUDY POPULATION 307
The FINRISK population surveys have been performed every 5 years since 1972 to monitor trends in 308 cardiovascular disease risk factors in the Finnish population [19,65] . The FINRISK 2002 study was based 309 on a stratified random sample of the population aged 25-74 years from six specific geographical areas 310 of Finland [66] . The sampling was stratified by sex, region and 10-year age group so that each stratum 311 had 250 participants. The overall participation rate was 65.5% (n = 8798 as the weight in kilograms divided by the square of height in meters measure with light clothing [19] . 329 Smoking status described whether a participant was a current daily smoker at the time of the survey. 330 Alcohol consumption, based on self-reported questionnaire, was measured as the average weekly pure 331 alcohol use in grams during the past 12 months. TRIG, GGT, HDL and LDL-cholesterol were measured 332 from blood samples collected from participants advised to fast for at least 4 hours prior to collection 333 and avoid heavy meals earlier during the day [19,67,68] . EHR follow-up of incident disease was until 334 December 31 st , 2016. The median follow-up was 14.84 years and the end point was the date of death or 335 last follow-up. Incident disease was coded as a binary variable indicating disease case (1) or non-case 336 (0) with matched time from baseline to event or end of follow-up also utilised for analyses. 337

CHARACTERIZATION OF THE GUT MICROBIOME 338
Stool samples were collected by participants and mailed overnight to Finnish Institute for Health and 339 Welfare for storing at -20°C; the samples were sequenced at the University of California San Diego in 340 2017. The gut microbiome was characterized by shallow shotgun metagenomics sequencing with 341 Illumina HiSeq 4000 Systems. We successfully performed stool shotgun sequencing in n = 7231 342 individuals. The detailed procedures for DNA extraction, library preparation and sequence processing 343 have been previously described [66] . Adapter and host sequences were removed. To preserve the quality 344 of data while retaining most of the disease cases, samples with sequencing depth less than 400,000 were 345 excluded from our analysis. The metagenomes were classified using default parameters in Centrifuge 346 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint 1.0.4 [69] , and using an index database based on taxonomic definitions from the Genome Taxonomy 347 Database (GTDB) release 89 [70] [71] . 348 The gut microbial composition was represented as relative abundance of taxa. For each metagenome at 349 phylum, class, order, family, genus and species levels, the relative abundance of a taxon was computed 350 as the proportion of reads assigned to the clade rooted at this taxon among total classified reads of this 351 metagenome. The relative abundance of a bacteria that had no reads assigned in a metagenome was 352 considered as zero in the corresponding profile. We focused on common and relatively abundant taxa 353 of a within-sample relative abundance greater than 0.01% in more than 1% of samples. The centered 354 log-ratio (CLR) transformation was carried out on abundance data by taking the log of taxa abundance 355 divided by geometric mean of abundance in each metagenome profile. Abundance of zero was replaced 356 with a value representing 1/10 of the minimum abundance in a metagenome before transformation. In 357 this study, all analyses except for microbial diversity calculation were based on CLR transformed data. 358

DISEASE CASE DEFINITIONS 359
The liver disease investigated in this study consists of two groups, alcoholic liver disease (ALD) and a 360 broader range of any liver disease (LD) according to the ICD-10 codes (Finnish modification). A sample 361 was considered as an incident case of any liver disease if the follow-up register-based diagnostic 362 classification was under the ICD-10 codes K70 -K77; the alcoholic liver disease was defined by the 363 ICD-10 code K70. In the present study, the disease diagnosis was last followed up by the end of 2016. 364

INCLUSION AND EXCLUSION CRITERIA 365
The inclusion criteria of FINRISK 2002 cohort have been previously described [19] . Samples with gut 366 microbiome profiles, phenotype metadata and follow-up all available were included in our analysis 367 (n=7115). The exclusion criteria of our analysis were: (1) samples with gut metagenomic sequencing 368 yielding <400K reads; (2) presence of baseline prevalent diagnosis of target disease for prediction; (3) 369 baseline pregnancy during the survey year. Altogether, 7005 and 6965 samples were included for 370 modelling analyses of ALD and LD, respectively. 371

PREDICTION MODELING OF INCIDENT LIVER DISEASES 372
General framework. Prediction models were developed for any liver disease and alcoholic liver disease 373 at phylum, class, order, family, genus and species levels separately. For each incident disease to be 374 predicted, samples were randomly shuffled and partitioned into a training cohort for discovery and a 375 validation cohort for evaluation at a 7:3 ratio according to the target disease variable such that the 376 distribution of disease cases and healthy controls in training and testing datasets were consistent. Within 377 the training set, we first performed pre-selection of features (detailed in next section) and then 378 developed models using pre-selected features through 5-fold cross validation stratified according to the 379 prediction target, which further created random splits of internal training and testing sets at a 8:2 ratios 380 five times with testing sets being mutually exclusive. The models were optimized based on cross-381 validated results. The optimal models were then trained on the full training set and finally assessed on 382 the withheld validation set that was excluded from the training and optimization process to avoid data 383 leakage from the training set. Considering the variation of attribute distributions that can occur during 384 random data partitioning, we repeated the whole process described above 10 times and reported the 385 average results. The detailed procedures were elucidated in the rest of this section. 386

Pre-selection of microbial taxa.
To select a set of informative microbial taxa that were individually 387 associated with incident liver disease, we analyzed the relationship between microbial abundance and 388 incident disease using (1) logistic regression adjusted for age and gender, (2) Cox regression for time 389 to disease occurrence adjusted for age and gender, and (3) Spearman correlation. This feature selection 390 step was performed only within the training datasets accounting for 70% of samples. A microbial taxon 391 was included in further analyses if statistical significance (P<0.05) was found by any of the above three 392 methods. After adjusting for age and gender, on average 8 phyla, 14 classes, 35 orders, 103 families, 393 299 genus and 406 species were associated with incident ALD at statistical significance using logistic 394 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. Model development. The machine learning approach extreme gradient boosting was applied to predict 416 the incidence of liver disease from baseline phenotype and microbial data using Xgboost library in R. 417 Xgboost is a distributed and optimized implementation of gradient boosting decision trees, an ensemble 418 method of sequential and additive training of trees with regularizations [72] . The prediction procedure 419 was a twofold process which involved developing models using microbial features alone and in 420 combination with conventional risk factors. In the first step the gradient boosting classifiers were trained 421 on microbial features consisting of taxa abundance and diversity metrics at different taxonomic levels 422 separately. In the second step, microbial features selected by the embedded feature selection of gradient 423 boosting classifiers in the first step, together with conventional risk factors, were deployed to predict 424 incident disease. The models were trained with Bayesian optimization (mlrMBO in R) through 5-fold 425 cross validation in the training dataset. The optimal models selected based on cross-validated results 426 were evaluated in the withheld evaluation dataset as the final performance for predicting incident 427 disease. The highly ranked and frequently selected (by more than half of the models) microbial features 428 were considered as predictive signatures for further interpretation. Since logistic regression was one of 429 the most widely used statistical tools for building clinical prediction models, we compared its prediction 430 performance with gradient boosting classifiers using the same training and evaluation sets. In addition, 431 we performed Ridge regression, which was more suited to correlated microbiome features by adding 432 an L2 penalty term to the loss function, following consistent data partitioning strategies. The Ridge 433 regression was optimized by a fine grid search of parameters with cross-validation of the same divisions 434 of folds as the gradient boosting classifier. 435 Benchmarking reference models with conventional methods. Currently, prediction models for liver 436 disease are commonly built by regression methods of conventional risk factors. Therefore, reference 437 models were built using logistic regression of commonly used liver disease predictors including age, 438 gender, BMI (kg/m 2 ), WHR, alcoholic consumption (g), smoking status, TRIG (mmol/l), GGT (U/L), 439 HDL and LDL cholesterol (mmol/l), as a benchmark procedure. 440 Model evaluation. The prediction performance of all models was evaluated in the corresponding 441 withheld validation dataset (30% of samples) that were not used for discovery. The area under the 442 receiver operating characteristic curve (AUROC) was used to compare the performance across models 443 of different methods and features. The AUROC is a widely applied metric that considers the trade-offs 444 between sensitivity and specificity at all possible thresholds for comparing the performance across 445 various classifiers with a baseline value of 0.5 for a random classifier. Area under the precision-recall 446 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint curve (AUPRC) was provided as a complementary assessment, particularly when constructing risk 447 models combining microbiome and conventional risk factors. AUPRC considers the trade-offs between 448 precision (or positive predictive value) and recall (or sensitivity) with a baseline that equals the 449 proportion of positive disease cases in all samples. Since AUPRC is more sensitive to higher ranks of 450 the positive class, it is preferred for highly imbalanced datasets where, for example, case numbers are 451 small relative to controls. As the entire model development process was repeated 10 times, following 452 the 10 randomly sampled partitions of training and validation datasets, each data partitioning led to a 453 set of optimal models developed in the corresponding training dataset. The final performance of optimal 454 models developed from discovery data was evaluated in the corresponding validation data that were set 455 apart in the beginning. The average results of data partitions were reported. To further assess the final 456 prediction result, we considered the species-level microbiome models using gradient boosting 457 classifiers, which outperformed microbiome-only models based on other taxonomic levels for both LD 458 and ALD. In the withheld validation datasets of various partitions, Cox regression models of 459 conventional predictors and in combination with predicted scores of microbiome-only models were 460 built using the time difference between baseline and follow-up disease occurrence or the end of follow-461 up. The Cox models were evaluated by the concordance statistic (c-statistic is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint disease, similar trends were observed. For comparison, a conventional prediction model is shown in red. 684 Error bars represent mean and standard deviation. Horizontal dashed lines mark the mean performance 685 of conventional models. 686 687 688 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint for gradient boosting models using species-level gut microbiome data together with conventional risk 691 factors (blue), or a conventional risk factor model (red), with predicting (a) incident any liver disease 692 or (b) alcoholic liver disease. Area under the precision-recall curve (AUPRC) for (c) any liver disease 693 and (d) alcoholic liver disease. Error bars represent mean and standard deviation. Horizontal dashed 694 lines mark the mean performance of conventional model as a reference. The bolded ROC and precision-695 recall curves correspond to models with AUROC and AUPRC that are closest to mean performance 696 reference. 697 698 699 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10.1101/2020.06.24.20138933 doi: medRxiv preprint Figure 5. Microbial taxa predictive of liver disease. A bacterial taxonomy tree (phylum to family-706 level) whose members at lower ranks showed predictive signal for incident liver disease. For full 707 taxonomy, see Supplementary Figure 2. 708 709 710 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 25, 2020. . https://doi.org/10. 1101/2020