COVID-19 prognostic model using Bayesian networks learnt on patient data

The response to the ongoing second wave of the COVID-19 pandemic can be helped by giving medical professionals access to models learned on patient data. To achieve this, we learned a Bayesian network model to predict risk of ICU admission, death and time of stay in the hospital from patient history, initial vital signs, initial laboratory tests and medication. Data were obtained from patients that were admitted to an HM hospital with suspicion of COVID-19 until 24/04/2020, excluding unconfirmed diagnosis, those who were admitted before the epidemic started in Madrid, had an outcome that was not discharge or death or died within 24 hours of presentation. Relevant variables for the model were selected with help from medical professionals. We learned the model using Bayesian search as implemented in GeNIe. Of 2,307 patients in the dataset, 679 were excluded. With the remaining 1,645 patients, we learned a model that predicted death with 86.4% accuracy. Some of the initial variables were discarded because they were independent of the outcomes of interest conditioned on some of the other variables. This high redundancy might be useful to build simpler tests for the severity of COVID-19. We show how the model can be used at different stages of patient admission and even with only partial information about the patient. This can be done by clinicians that want a fast second opinion or a summary of the available data from previous patients similar to the one at hand. We then include how we plan to improve the model with extra patient data and how it could be expanded to other contexts, like for example, an epidemiological one.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic, declared by the WHO Director General at the media briefing on March 11 th 2020, is caused by the named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) 1 .It started in Wuhan, China, but it later spread to the rest of the world, with as of the last week of 2020, over 84.53 2 million confirmed cases worldwide.
Spain, with 1,958,844 confirmed cases and 51,078 confirmed deaths as of December 31th 2020 2 , has been undergoing a very strong second wave (see Fig. 1) of the pandemic over the fall with a peak in November and cases on the .rise again with 297 cases notified over the last 14 days per 100,000 inhabitants 2 .
Because it is a new epidemic, the specific mechanisms and pathophysiology remain elusive, with clinical experience showing significant heterogeneity in the development of symptoms in severe acute cases 3 .The analysis of data on the clinical characteristics, received treatment and outcomes of COVID-19 patients is of vital importance to reduce its mortality, target treatment to presentation of the disease and help with triage and proper management of hospital resources.There has been previous work on finding prognostic models and clinical predictors for COVID-19 4,5,6,7 but it has been found to be flawed 8 due in part to poor reporting, no explanation of the intended use of the models and no description of the study population.There is also a lack of work of this type in Spain with, to the best of our knowledge, only one other study in collaboration with a hospital in Madrid 7 .
For these reasons, in this article we present a Bayesian network model learnt using data from 1,645 patients admitted with symptoms of COVID-19 to hospitals of the HM network in Madrid during the first wave of the pandemic.This work has been done in partnership with various medical professionals which helped with an initial analysis and reporting of the study population 10 and further posed concrete questions for the model to answer so that there is a clear use case. .

Data sources
We obtained the data from the HM hospital network in Madrid, thanks to its project 'COVID DATA SAVE LIVES' 11 .This anonymized clinical dataset comes from the HM hospitals HER system.It was openly released on April 25th on demand to research groups that wanted to analyse it, provided they presented a project beforehand and said project was approved by the corresponding board of experts.
The data included patients' age, gender, past diagnoses, smoking status, admission data, initial vital signs and complementary tests performed in the Emergency Room, vitals and tests performed throughout their hospital stay, treatments received (including previous medications continued and specific treatment for COVID-19), destination at discharge (or death) and diagnoses during their stay.

Exclusion criteria
Patients admitted before the first cases were declared in Madrid (24/02/2020), who had not yet reached an outcome (discharge or death) by 24/04/2020, transferred to a different hospital for admission, voluntarily discharged, with an unconfirmed diagnosis or that were interned for less than 24 hours were excluded from the analysis.After exclusion there were 1,645 patients remaining.The exclusion process is explained in Figure 2.

Model learning Variable selection
Since the dataset had an enormous number of variables for each patient, we could not use all of them for risk of overfitting the model to the comparatively small amount of training data we had.
Therefore, we chose to focus on only a few variables.To do so, we consulted two independent experts on what the most For these features we took the first analysis they had after arriving at the hospital or, when possible, the average of the first two if taken within 24 hours so as to reduce the amount of missing data.Viral load and Interleukin-6 level were considered too.However, viral load was not available in the data and only 30 patients had been tested for IL-6 level in their first two tests, so those variables were discarded.
We also added sixteen medications that were considered the most important by the experts as binary variables indicating whether a patient had been administered the drug or not.The drugs considered for the network were: four corticoids (methylprednisolone, dexamethasone, hydrocortisone and prednisone); two antivirals (ritonavir with lopinavir, and oseltamivir); seven antibiotics (amoxicillin, piperacillin, linezolid, azithromycin, ceftriaxone, meropenem and levofloxacin); and three immunomodulators (hydroxychloroquine, tocilizumab, and interferon beta 1-b).
Finally, three possible target variables were added: time of stay in the hospital, ICU admission and death, all relevant for the main objective of predicting the severity of the disease in a given patient.Table 1 shows all the predictor variables used for the model.For a complete analysis of the dataset, see previous work by our team 10 .We studied the association of the variables to mortality and used propensity-score matching to find the effect of treatments on patient outcomes.
All lab tests but LDH, D-Dimer and lymphocyte count were removed from the network in an effort to improve readability by reducing the number of redundant variables.The same happened with initial blood pressure and the other vital signs.This was done because those variables were found in every network learned to be independent of mortality or ICU admission given the other initial variables.This was done through the d-separation graphical criterion 12 .

Missing data
All variables except for the initial laboratory tests and initial vital signs were complete.The ones that were not, were imputed using an iterative imputation procedure, that learned a multivariate regression on all the variables except one against that, then used the newly imputed variable to learn a new regression for the other variables 13 .This process was repeated until the imputed dataset was stable.

Bayesian networks
A Bayesian network (BN) is a probabilistic graphical model that combines probability and graph theory to efficiently represent the probability distribution of a group of variables 12 .BNs model probabilistic conditional dependencies and independencies between the variables in terms of a directed acyclic graph and a series of conditional probability distributions (CPDs).Each of the nodes in the graph represents a variable with the edges representing conditional (in)dependencies between the variables.Each of the CPDs is associated with a variable and gives the probability distribution of that variable conditioned on its parents in the graph, that is, the nodes that have edges directed towards , which we denote .The formula for the joint probability distribution of the variables given all the CPDs is: where is the probability of the structure given the data, is the prior distribution over the structures and is the probability of the data given the structure (the marginal likelihood).BDeu assumes a uniform prior so that maximizing is the same as maximizing for which there is a closed formula.The assumption of a uniform prior is reasonable since it is mostly uninformative, so no structures are preferred over others initially.
With this score, the algorithm starts with a structure at random and does random edge additions, removals and reversals while checking if the score increases or not.As long as the score increases, the change is accepted.If the score doesn't change or decreases the change is rejected.If no changes increase the score, the algorithm stops and returns the current network which is the highest scoring one.Then, the algorithm repeats the process with another random starting structure for a number of iterations decided by the user.
Finally, the networks resulting from each iteration are compared and the highest scoring one is returned as the result.The main change in our process was that the comparison between models after each iteration was not done on the basis of the BDeu score but on how well they predicted patient mortality, since it is the most important variable of the three we want to predict.This accuracy was tested through leave-one-out cross-validation.This was done in an effort to get the best possible model at assessing the severity of each patient.

Model
The BN model structure is shown in Figure 3.In this state, without any evidence, it serves as a summary of the distribution of the original data.The three variables of interest for prediction are the length of stay, admission to ICU and death.Tables 2a and 2b show accuracy and area under the ROC curve for each of the targets.
Accuracies for the length of stay seem to indicate that the model is generally overestimating the severity of the cases. .

Use case explanation
The model can be used for prognosis at the clinical level (giving doctors the ability to have a quick second opinion that just summarizes the available data), as an example figures 4 to 6 show how the model could be used for a concrete case showing it as we get more information on the severity of the case.From admission (Fig. 4), with only demographic and triage data; followed by adding the initial laboratory tests and past conditions (Fig. 5) and, finally, the medication that the patient is receiving (Fig. 6). .

Conclusions
We have presented a Bayesian network model learnt using data from patients diagnosed with COVID-19 during the first wave of the pandemic in Spain.Given the current, rapidly worsening, situation we believe that models such as this can be used to help clinicians arrive at conclusions informed by previous patient data much faster which would facilitate the work, especially if the situation worsens and hospital saturation increases.
The model is not a replacement for medical professionals but a tool for informed decision making that will hopefully be useful.Our current goal is to gather as much data as we can so that the model can be improved and then work on building a browser-based app to make it as accessible as possible.
. Finally, there is a possible application of this model to an epidemiological setting by taking demographic data and using it to predict incidence of ICU admission and time spent in the hospital if we have an estimate of how much of the population will be infected and how many will need hospitalization.This would help with better resource allocation.