AI-based multi-modal integration of clinical characteristics, lab tests and chest CTs improves COVID-19 outcome prediction of hospitalized patients

.

With 15% of severe cases among hospitalized patients 1 , the SARS-COV-2 pandemic has put tremendous pressure on Intensive Care Units, and made the identification of early predictors of future severity a public health priority. We collected clinical and biological data, as well as CT scan images and radiology reports from 1,003 coronavirus-infected patients from two French hospitals. Radiologists' manual CT annotations were also available. We first identified 11 clinical variables and 3 types of radiologist-reported features significantly associated with prognosis. Next, focusing on the CT images, we trained deep learning models to automatically segment the scans and reproduce radiologists' annotations. We also built CT image-based deep learning models that predicted future severity better than models based on the radiologists' scan reports. Finally, we showed that including CT scan features alongside the clinical and biological data yielded more accurate predictions than using clinical and biological data alone. These findings show that CT scans provide insightful early predictors of future severity.
Previous studies have demonstrated that risk factors for severe evolution include demographic variables such as age, comorbidities, and biological variables measured within 2 days of patient admission [2][3][4] . Beyond clinical and biological variables, computerized tomography (CT) scans are also potential sources of information: the degree of pulmonary inflammation is associated with clinical symptoms and severity 5,6 , and the extent of lung abnormality is predictive of severe disease evolution 7,8 . Here we evaluated to what extent visual or AI-based analysis of CT scans at patient admission added information about future severe disease evolution once clinical and biological data had been taken into account.  Radiologist report: critical stage of COVID-19 (stage 5). (Middle) 56-year-old man, with diffuse distribution and multiple large regions of subpleural GGO with superimposed intralobular and interlobular septal thickening (crazy paving). Estimated disease extent by AI: 51%/68% (right/left). Radiologist report: severe stage of COVID-19 (stage 4). (Bottom) 70-year-old woman, with minimal impairment, and multiple small regions of subpleural GGO with consolidation to the right lower lobe. Estimated disease extent 13%/7% (left/right). Radiologist report: moderate stage of COVID-19 (stage 2). Coronavirus progression is evaluated by the World Health Organization on a 1 to 10 scale, severe scores of 5 or more corresponding to an oxygen flow rate of 15 L/min or higher, or the need for mechanical ventilation, or patient death 9 . We first evaluated how clinical and biological variables measured at admission were associated with future severe progression (score of 5 or more). These variables were available for 989 individuals, and we computed the severity odds ratios for each individual variable, and at each hospital center ( Figure 1). When combining association results from the two centers, we found 11 variables significantly associated with severity (P <0.05/63 to account for testing 63 variables, Figure 1 7,15,16 . We hypothesize that peripheral topography has a positive impact on prognosis because peripheral lesions could be less extended. We next trained a deep neural network called AI-segment (Supp Figure 1) to segment radiological patterns and provide automatic quantification 18,19 of their volume, expressed as a percentage of the full lung volume. These patterns included the three distinguishable features that appear as disease severity progresses 17 : ground glass opacity or GGO, crazy paving, and finally consolidation. AI-segment was trained on 161 patients from KB and evaluated on 132 patients from IGR, of which 14 fully annotated, and 118 partially annotated. The mean absolute error in volume prediction for the fully annotated scans was 6.94% for GGO, 1.01% for consolidation, and 7.21% for sane lung (no crazy paving was present in these scans). On the larger cohort of partially annotated scans, the accuracy with respect to the radiologist score was 78% for GGO, 67% for crazy paving, and 74% for consolidation (for a 1% detection threshold on the AI-segment result, Supp Table 1).

AI-segment
also accurately quantified the disease extent (Supp Figure 3). AI-segment visual results were also consistent with radiologist observations (See Figure 2 for three representative cases). We lastly evaluated to what extent the AI-segment trained on CT 5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint scans provided finer information about future severity compared to radiologists' scan reports. Using predicted volumes from AI-segment , we found that GGO (OR KB 1. 8 16) were all associated with severity (accounting for multiple testing). This confirms that automatic estimation of lesion volumes can add more precise measures of future severity to the radiologists' scan reports (Supp Table 2) 8 .
We next evaluated the prognostic value of CT scans alone through three different models. The first model called report included variables from the radiological report only. The second was based on the automatic lesion volumes measured by AI-segment . The third called AI-severity used a weakly supervised approach with no radiologist-provided annotations (Supp Figure 2) 20 . All three models were trained on 646 KB patients, tested on 150 KB validation patients, and validated on the independent IGR dataset of 137 patients ( Figure 3). On the validation set from KB hospital, report was outperformed by AI-severity but not by AI-segment (AUC AI-severity = 0.76, AUC AI-segment = 0.68, AUC report = 0.72). On the independent IGR validation set, both AI-segment and AI-severity outperformed the model report (AUC AI-severity = 0.70, AUC AI-segment =0.68, AUC r eport =0.66). Our follow up analyses revealed that the predictive performance of AI-severity was strong in part because the internal representation of the neural network captures clinical features from the lung CTs, such as age, on top of the known COVID-19 radiology features (see interpretability of AI-severity in Supp Material).
Lastly, we evaluated whether CT scans have prognostic value beyond what can be inferred from clinical and biological characteristics alone. We therefore compared the performance of trimodal CT scan / clinical / biological models to bimodal clinical / biological models. We compared model performances for three outcomes: our initial WHO-defined high severity outcome of "oxygen flow rate of 15 L/min or higher, or need for mechanical ventilation, or death", as well as two other outcomes studied in the literature, "death or ICU admission", and "death". We built a trimodal version of report , AI-segment , and AI-severity , adding clinical and biological information to the original CT scan-based models by implementing a greedy search approach to include optimal variables (Supp Figure 4). All three trimodal models performed consistently better than the bimodal biological/clinical model ( Figure 3 and Supp Table 3), whether it be trimodal report , AI-segment , or AI-severity ( mean AUC increase of 0.02-0.03 ) . They also outperformed clinical/biological models from literature (Colombi at al model 7 and MIT COVID analytics model). Of note, the fact that the models trained with patients from the KB hospital had good performances when evaluated on IGR hospital is evidence of their robustness, especially since these two hospitals receive patients with very different comorbidities (85% of cancer patients at IGR and 7% at KB). Taken together, these consistent results confirm the added prognostic value of CT scans. Importantly, while trimodal AI-severity generally outperformed trimodal report across all outcomes, and trimodal AI-segment sometimes outperformed report , the AUC difference was always modest (max increase of 0.03 for AI-severity vs report , and max increase of 0.02 for AI-segment vs report ), showing that the incorporation of CT-scan analyses, no matter what the method, is the strongest performance booster. Therefore beyond AI modeling, our study shows that a 6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint composite scoring system integrating selected radiological measurements with key clinical and biological variables provides accurate predictions and can rapidly become a reference scoring approach for severity prediction.
7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . Our retrospective study conducted on two French hospitals shows that future disease severity markers are present within routine CT scans performed at admission, and these can be identified and quantified via AI-based scoring, providing useful and interpretable elements for prognosis.

9
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. The clinical and laboratory data were obtained from detailed medical records, cleaned and formatted retrospectively by 10 radiologists with 3 to 20 years of experience (5 radiologists at GR and 5 at KB). Data from the clinical examination include: sex, age, body weight and height, body mass index, heart rate, body temperature, oxygen saturation, blood pressure, respiratory rate, and a list of symptoms including cough, sputum, chest pain, muscle pain, abdominal pain or diarrhoea, and dyspnea. Health and medical history data include presence or absence of comorbidities (systemic hypertension, diabetes mellitus, asthma, heart disease, emphysema, immunodeficiency) and smoker status. Laboratory data include conjugated alanine, bilirubin, total bilirubin, creatine kinase, CRP, ferritin, haemoglobin, LDH, leucocytes, lymphocyte, monocyte, platelet, polynuclear neutrophil, and urea.

15
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

CT scan acquisition
Three different models of CT scanners were used : two General Electric CT scanners (Discovery CT750 HD and Optima 660 GE Medical Systems, Milwaukee, USA) and a Siemens CT scanner (Somatom Drive; Siemens Medical Solutions, Forchheim). All the patients were scanned in a supine position during breath-holding at full inspiration. The acquisition and reconstruction parameters were of 120kV tube voltage with automatic tube current modulation (100-350 mAs), 1mm slice thickness without interslice gap, using filtered-back-projection (FBP) reconstruction (SOMATOM Drive) or blended FBP/iterative reconstruction (Discovery or Optima) . Axial images with slice thickness of 1 mm were used for coronal and sagittal reconstructions.
The scans performed were independently examined by experienced radiologists using a standard workstation in the clinical image archiving and transmission system. All radiologists were informed of patients clinical status (suspicion of COVID-19, clinical signs of severity).

Definition of CT Features
COVID-19 associated CT imaging features identified by radiologists were defined following ACR recommendation 1 . The term parenchymal opacification is applied to any homogeneous increase in lung density on chest CT. When this parenchymal opacification is dense enough to obscure the vessels margins and airway walls and other parenchymal structures, it is called consolidation. Ground-glass attenuation is defined as an increase in lung density not sufficient to obscure vessels or preservation of bronchial and vascular margins crazy-paving pattern was defined as ground-glass opacification with associated interlobular septal thickening 2 .
For 959 patients, CT imaging characteristics were evaluated and the following findings were reported: ground glass opacity (rounded / non rounded / absent), consolidation (rounded / non rounded / absent) interlobular septal thickening or "crazy paving" (present / absent), subpleural line, lymph node enlargement, pleural effusion, and pericardial effusion, according to morphological descriptors based on recommendations of the Fleischner Nomenclature Committee 2 .
The results of the CT were examined in terms of location, distribution, size and type. The location refers to the different lobes and segments involved (lower or medium or upper). The distribution was described as peripheral (1/3 external of the lung), central (2/3 internal), or both central and peripheral.
The assessment of the size and extent of lung involvement was based on a visual classification of lung anatomy according to the evaluation criteria established by the French Society of Radiology (SFR) 3 . The size of the lesion was assessed; the volume of the lung affected absent / minimal (<10%) / moderate (10-25%) / extensive (25-50%) / severe (>50%) 16 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . / critical >75%. The coding absent / minimal / moderate extensive / severe / critical was based on a quantitative variable with values of 0 / 1 / 2 / 3 / 4 / 5.

Automatic extraction from radiological report
Radiological features from radiological reports were automatically extracted using Optical Character Recognition and regular expression functions.

Annotation scenario of CT scans by radiologists in order to train the AI-Volumetry model
Two radiologists (4 and 9 years of experience) examined and annotated 292 anonymized chest scans independently and without access to the patient's clinic or COVID-19 PCR results. All CT images were viewed with lung window parameters (width, 1500 HU; level, -550 HU) using the SPYD software developed by Owkin. Regions of interest were annotated by the radiologists in four distinct classes : healthy pulmonary parenchyma, ground glass opacity, consolidation, crazy-paving. One AI and imaging PhD student provided full 3D annotation of the four classes on 22 anonymized chest scans using the 3D Slicer software.
The presence of organomegaly was also notified when present, as a binary class. When multiple CT images were available for a single patient, the scan to analyze was selected using the SPYD software.

Models for segmentation of CT scans ( AI-segment )
In the proposed pipeline called AI-segment for lesion segmentation from CT scans, we deployed 3 segmentation networks: 3D Resnet50 4 , 2.5D U-Net, and 2D U-Net 5 . These are three powerful convolutional neural networks that have achieved state of the art performance in numerous medical image segmentation tasks. U-Net consists of convolution, max pooling, ReLU activations, concatenation and up-sampling layers with sections: contraction, bottleneck, and expansion. ResNet contains convolutions, max pooling, batch normalization, and ReLU layers that are grouped in multiple bottleneck blocks.
All models were trained on CT scans provided by Kremlin-Bicêtre (KB) and evaluated on annotated CT scans Institut Gustave Roussy (IGR). The dataset was divided into two categories: Fully Annotated Scans (FAS) composed of 22 scans (8 from KB and 14 from IGR) and Partially Annotated Scans (PAS) composed of 292 scans (153 from KB and 118 from IGR) 2D U-Net was trained for left/right lung segmentation while 3D ResNet and 2.5D U-Net were used for lesion segmentation. 3D ResNet50 was trained on 8 KB FAS. We used Stochastic Gradient Descent for parameter optimization and a learning rate starting of 0.1 with a decay factor of 0.1 every 20 epochs. The network was trained for a total of 100 epochs. As for 2.5D U-Net, Adam optimization algorithm was used with learning rate, weight decay, gradient clipping and learning rate decay parameters set respectively to 1e-3, 1e-8, 1e-1, and 0.1 17 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . (applied at epochs 90 and 150) for 300 epochs. While the validation set remains the same as 3D resnet50, 153 KB PAS scans were added to the 8 KB FAS, in the training set. PAS were only added to the 2.5D U-Net training set due to the incompleteness of the annotated volume (on average 16 slices are annotated per PAS) in the scans which would not satisfy the volumetric requirements of the 3D ResNet50 input. Finally, for the left/right lung segmentation, the 2DU-Net was trained on the 8 KB FAS. Similarly to 2.5D U-Net, Adam optimization algorithm was used with learning rate, weight decay, gradient clipping and learning rate decay parameters set respectively to 1e-3, 1e-8, 1e-1, and 0.1 at epoch 70 over 104 epochs. Both 2.5D U-Net and 2D U-Net use affine transformation and contrast change for data augmentation while 3D resnet50 uses affine transformation, contrast change, thin plate splines, and flipping. 3D ResNet and 2.5D U-Net are trained through the minimization of the cross entropy loss and 2D U-Net minimizes the binary cross entropy loss. All training was performed on NVIDIA Tesla V100 GPUs and Pytorch is the used framework. During the validation phase, ensemble inference 6 is performed on all the available scans.

Models for severity classification based of CT scans ( AI -severity)
The AI -severity model is defined as an ensemble of four sub-models, as illustrated in Supp  Fig 2. Each of these sub-models is designed to predict the disease severity from CT scans. Since they do not require expert annotations at the slice level, these sub-models fall in the scope of weakly supervised learning . The preprocessing of the data consisted in resizing the CT scans to 10mm pixel spacing along the vertical axis and obtaining a segmentation of the lungs using a pre-trained U-Net algorithm 7 . Each sub-model is composed of two blocks: a deep neural network called feature extractor and a logistic regression. CT scans may contain biases such as catheters (EKG monitoring, oxygenation tubing...) that are easily detectable in a CT and can bias the prediction of severity ( i.e. predict the presence of a technical device associated with severity instead of predicting the radiological features associated with severity). In order to ensure that these biases do not affect the features, the lung segmentation mask was applied before the features were extracted. As a result, only the lungs were visible to the feature extractor .
Two of the sub-models used an EfficientNet-B0 8 pre-trained on the ImageNet public database as feature extractor while the other two used a ResNet50 9 pre-trained with MoCo v2 10 on one million CT scan slices from both Deep Lesion 11 and LIDC 12 . Each of these networks provide an embedding of the slices of the input CT scans into a lower-dimensional (1280 for EfficientNet-B0 and 2048 for ResNet50 with MoCo v2) feature space. A windowing used for selecting specific ranges of intensities was also applied on the CT scans before the features extraction. For the two sub-models based on the EfficientNet-B0, the image intensities were respectively clipped in the (-1000 HU, 200 HU) and (-1000 HU, 600 HU) range. For one of the remaining two sub-models (based on ResNet50 with MoCo v2), the (-1350 HU, 150 HU) range was used whereas for the last one, a combination of the following ranges was used: (-1000 HU, 0 HU), (0 HU, 1000 HU) and (-1000 HU, 4000 HU). Finally, for each of these sub-model, a Logistic Regression (with ridge penalty) was used to predict the disease severity from the averaged features. For the ResNet50-based sub-models, a Principal Component Analysis (PCA) with 40 components was used to reduce the dimensionality of the feature space before the Logistic Regression was applied. All the 18 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . sub-models were equally weighted in the ensemble and the disease severity predictions of the AI-severity model were obtained by averaging the prediction of the models in the ensemble.

Interpretability of AI-severity
An interpretability study was conducted on AI-severity to get a better understanding of its performances. The correlation between the internal representation of the sub-models ( i.e. the input of the logistic regression),radiological and clinical variables were analyzed. By replacing the output of the logistic regression by variables from the radiology reports, AUC on the KB validation set of 150 patients were 94.1% for disease extent (threshold >2), 71.4% for crazy paving, 67.1% for condensation and 74.8% for GGO, showing that the feature extractors correctly captured part of the radiology signal. More interestingly, it was also possible to correlate internal representations with clinical variables such as age (AUC 85.1% with a threshold of 60 years old), sex (AUC 85.2%) or oxygen saturation (AUC 76.2%, threshold 90%). As a comparison, a logistic regression trained on the radiology report variables only gets respectively AUC scores of 70.0%, 59.9% and 67.8%. This gap shows that the AI-severity internal representations present within the neural network capture clinical information directly from CT scans.

Models for multimodal integration
The models used to predict the outcome from multiple modalities are logistic regressions, trained by cross validation with 5 folds on the training dataset of 646 patients from KB, stratified by age and outcome. Variables that were filled for less than 300 patients (conjugated bilirubin and alanine) were not used. For the remaining variables, missing values were simply replaced by the average over patients of the training set. L2 regularization was applied to the weights of the models. The regularization coefficient value was chosen by comparing the results obtained in cross validation with different values, ranging from 0.01 to 100. The value maximizing the average AUC over the 5 folds was selected. We use pandas and scikit-learn to manipulate data and perform machine learning algorithms 13 .

Selection of clinical and biological variables added to the models based on CT scan variables
Clinical and biological variables were selected through a forward feature selection technique ( Supp Fig 4). At baseline (left of the figure), a model was trained in cross-validation using only a fixed set of variables. Three initial sets were considered here: radiologist report, AI Lungs and AI volumetry. The variables encoded in the radiologist report includes a presence/absence coding of Ground Glass opacity (GGO), rounded GGO, Crazy paving, Consolidation, Consolidation rounded, Topography peripheral, and Predominance inferior, as well as disease extent, which is a semi automatic assessment of the amount of lesions in 19 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . the lung. The AI-Lung model includes the one variable output of the neural network model to predict severity and the AI volumetry model includes the automatic quantification of the ground glass, consolidation and crazy paving pattern, and the automatic quantification of disease extent. For comparison, the procedure was also performed starting from an empty set of variables (clinical only). The added prognosis value of every clinical or biological variable was then assessed separately, by training a new model using this variable in addition to the previous set. The variable resulting in the largest AUC score was added to the selection. This procedure was repeated for 20 iterations. For every initial selection, performances of the models increased quickly at first (left part of Supp Fig 4), then reached a plateau (right half of the figure), indicating that the variables added after the tenth iteration did not significantly increase the predictive power of the models. Thus, for every case, only the ten best clinical and biological variables were selected.

Training and evaluation of models
To predict severity, models were trained on 646 patients from KB, which included the training set of AI-segment , and evaluated on two distinct evaluation sets, with 150 patients from KB and 137 patients from IGR. The prediction is performed using the logistic regression approach.
We evaluated models that predict severity using the Area Under the Curve (AUC) and differences between AUC values were tested using DeLong test 14 .
We evaluated the segmentation model AI-segment using mean absolute error that is defined as the average, over the available fully annotated CT scans in the validation set, of the absolute value of the difference between the ground truth percentage of each lesion type (deduced from annotations) and the estimated ones. We also evaluated the detection accuracy per lesion with respect to the reported radiologist scores, defined as the percentage of correctly predicted classes by AI-segment (GGO ; CP ; Consolidation) among the validation set. A given lesion type, in the AI-segment result, is considered as present when the estimated volumetry of the lesion type, averaged over both lungs, is above a certain threshold (here, we reported results for threshold 1% and 2%).

Benchmark models
We use the clinical and biological variables previously proposed in a multivariate risk score for severity, which is defined as admission to ICU or death, and we retrain a logistic regression model using these variables 15 . We also considered the MIT Covid Analytics calculator as a risk score for mortality ( https://www.covidanalytics.io/mortality_calculator) .

20
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint Supp Fig. 4: AUC curve as a function of the number of clinical and biological information added to the multimodal model. Variables included in the models consist of CT scan variables only and then a greedy algorithm adds clinical or biological variables iteratively. At each step of the algorithm, the variable that results in the largest increase of AUC score is added.

24
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 19, 2020.

25
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 19, 2020. . Table 3 : AUC values for the different models on the different sets. Each model was trained on 646 patients from KB. Results are reported on the validation set from KB (150 patients) and the external validation set from IGR (137 patients), as well as on the training set using 5 fold cross validation stratified by outcome and age (CV KB).

26
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 19, 2020. . https://doi.org/10.1101/2020.05.14.20101972 doi: medRxiv preprint