Identifying Sequential Complication and Mortality Patterns in Diabetes Mellitus: Comparisons of Machine Learning Methodologies

Background: Diabetes mellitus-related complications adversely affect the quality of life. Better risk-stratified care through mining of sequential complication patterns is needed to enable early detection and prevention. Methods: Univariable and multivariate logistic regression was used to identify significant variables that can predict mortality. A sequence analysis method termed Prefixspan was applied to identify the most common couple, triple, quadruple, quintuple and sextuple sequential complication patterns in the directed comorbidity pathology network. A knowledge enhanced CPT+ (KCPT+) sequence prediction model is developed to predict the next possible outcome along the progression trajectories of diabetes-related complications. Findings: A total of 14,144 diabetic patients (51% males) were included. Acute myocardial infarction (AMI) without known ischaemic heart disease (IHD) (odds ratio [OR]: 2.8, 95% CI: [2.3, 3.4]), peripheral vascular disease (OR: 2.3, 95% CI: [1.9, 2.8]), dementia (OR: 2.1, 95% CI: [1.8, 2.4]), and IHD with AMI (OR: 2.4, 95% CI: [2.1, 2.6]) are the most important multivariate predictors of mortality. KCPT+ shows high accuracy in predicting mortality (F1 score 0.90, ACU 0.88), osteoporosis (F1 score 0.86, AUC 0.82), ophthalmological complications (F1 score 0.82, AUC 0.82), IHD with AMI (F1 score 0.81, AUC 0.85) and neurological complications (F1 score 0.81, AUC 0.83) with a particular prior complication sequence. Interpretation: Sequence analysis identifies the most common pattern characteristics of disease-related complications efficiently. The proposed sequence prediction model is accurate and enables clinicians to diagnose the next complication earlier, provide better risk-stratified care, and devise efficient treatment strategies for diabetes mellitus patients.


Introduction
Continuous variables were presented as median (95% confidence interval [CI] or interquartile range [IQR]) and categorical variables were presented count (%). The χ 2 test with Yates' correction was used for 2×2 contingency data, and Pearson's χ 2 test was used for contingency data for variables with more than two categories. The Mann-Whitney U test was used to compare continuous variables. Differences between groups were tested using Kruskal-Wallis analysis of one-way variance (ANOVA). For each category of complication, we compared the age of onset and the difference between male and female groups. A twosided α of less than 0.05 was considered statistically significant. Prefixspan [15] was used to extract the sequential patterns of complications. To identify the significant complication factors associated with mortality of these diabetes mellitus patients, univariate logistic regression was used to estimate odds ratios (ORs) and 95% CIs. To avoid overfitting in the model, significant univariable variables previously identified were chosen for multivariable analysis. Statistical analyses were performed using RStudio software (Version: 1.1.456) and Python (Version: 3.6). Experiments are simulated on a 15-inch MacBook Pro with 2.2 GHz Intel Core i7 Processor and 16 GB RAM. much a given node is in-between others and is measured with the number of shortest paths (between any couple of nodes in the graphs) that passes through the target node. Betweenness measure is moderated by the total number of shortest paths existing between any couple of nodes of the network. Eccentricity centrality is a measure of the centrality of a node in a network based on having a small maximum distance from a node to every other reachable node (i.e. the graph eccentricities). The measures of hub are also used to indicate node importance in the network. PageRank measures the transitive influence or connectivity of nodes, and its main difference from eigen centrality is that it accounts for link direction. The concept of the eigenvector centrality of a node is that the centrality index is determined not only by its position in the network but also by the neighboring nodes. Clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. We interpreted the network properties of the complication outcomes in order to detect their roles in the sequential pathology network.

Development of an accurate sequence prediction model
One of the important and meaningful tasks for complication sequence analysis in diabetes mellitus is to predict the next possible complication outcome of a patient based on his/her previous complications. In this study, we developed an accurate sequence prediction model that can accurately predict the next complication outcome (or mortality) of a Compact prediction tree plus (CPT+) has been proposed as a fairly new probabilistic predictive model to assist sequential pattern analysis [24,25]. In this study, we developed a knowledge enhanced CPT+ (KCPT+) model which further improves overall prediction ability by considering previously known onset probability of couple, triple, and quadruple sequences, and at the same time remain the advantage of CPT+ to capture the subsequence similarities without information loss.
Specifically, for modeling contribution, we first conduct preliminary sequence analysis and identified the onset probabilities of couple, triple, and quadruple complication sequences in the diabetes mellitus dataset, which provides a broad prior understanding of more frequently occurred complication sequences. Then we incorporated these important prior sequence onset estimations into the optimization process of CPT+ model, to increase the probability of generating the next complication outcome if it is contained in a sequence (couple, triplet, or quadruple, quintuple and sextuple) that has been known to happen more frequently. In contrast, the predicted probability of a complication outcome is decreased if it is in a sequence that has a low onset possibility.
Most patients with diabetes mellitus had multiple complications throughout their lifetime. The model training and testing consider mortality and other complication as primary outcomes to be predicted based on the input of former complication sequences before the onset of the outcome. For instance, the model can be used to distinguish patients that may suffer from the most severe outcome (i.e., mortality) and requires immediate medical assistance. The model can also be applied to other complication outcome predictions with the input of previous complication sequences. In this way, the model can predict the next outcome based on any given previously experienced complication of a patient with diabetes mellitus.

Performance evaluation
To evaluate the model's performance of predicting the outcomes of sequences, we use evaluation metrics of accuracy (ratio of true predictions over all sequence predictions), the precision, sensitivity/recall, F1 scores (defined below), Matthew's correlation coefficient (MCC) and area under the curve (AUC) of the receiving operating characteristics (ROC) curve.

Cohort characteristics
This study included a total of 14,144 diabetic patients (51% males). The descriptive statistics of complication onsets at different age intervals, stratified by gender, are shown in  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

Significant complications to predict mortality
In univariable analysis with mortality as prediction outcome (

Sequential complication patterns
We provide an illustrative explanation about the basic concept of the proposed KCPT+ in Figure 2, in which the sequence weighting scheme aims to discriminate the onset . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; probability of the common and uncommon complication sequence. In this way, the model can accurately predict the next outcome of a given sequence in a scalable way by considering prior knowledge about sequence onset probability and preserving the advantage of CPT+'s lossless property to capture subsequence similarities.
We extract the sequential complication patterns of the 14, 144 patients with Prefixspan approach which identifies the couple, triplet, quadruple, quintuple, and sextuple sequential patterns according to the onset age of complications. The trajectories of complications are shown in Figure 3, which provides an easy-for-understanding graphical representation of the sequential complication patterns in diabetes mellitus. A wider line indicates more patients experienced that directed pairwise complication sequence with total patient number marked on the corresponding sequence edges. A Sankey diagram visualizes the proportional flow between complications within the pathology network. The Sankey network is used to illustrate the pathology development of diabetes mellitus complication patterns (Figure 4) with a corresponding number of patients who experienced that complication development (wider grey lines indicates more patients).
(1) Couple sequences The top 20 most frequent couple sequences are shown in Table 4. A total of 8491 patients died during the study period, and all had at least one complication. Among the couple sequences with mortality as the destination, renal complication was the commonest (n=3467), followed by ophthalmological complication (n=2231), ischemic heart disease with AMI (n=1849), heart failure (n=1821), ischemic heart disease without AMI (n=1692), atrial fibrillation (n=1289), neurological complications (n=1068), ischemic stroke (n=892), AMI without known IHD (n=563) and peripheral vascular disease (n=545).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (2) Triple sequences Neurological ischemic stroke dementia is the most common triple sequence (n=930) in the cohort ( (

3) Quadruple sequences
We also identified quadruple complication sequence patterns of the diabetes mellitus patients as shown in Table 6. The most frequent sequence was dementia IHD with AMI heart failure mortality (n=243), followed by ophthalmological renal heart failure mortality (n=131), neurological ophthalmological renal mortality (n=119), ophthalmological renal IHD with AMI mortality (n=100), renal IHD with AMI heart failure mortality (n=87).
(4) Quintuple and sextuple sequences . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; The identified most common ten quintuple and sextuple sequence patterns are included in Tables 7 and Table 8

Properties of the disease complication network
We conduct disease complication network analysis and calculated statistical properties ( Table 9). In terms of properties of degree connection in the directed pathology network, renal, ophthalmological complications, atrial fibrillation, neurological complications, ischemic stroke, dementia have the largest values of in-degree (all with 12) and out-degree (all with 23), followed by heart failure and peripheral vascular disease both with in-degree (11) and out-degree (22), implying their important 'intermediate' role in the network.
However, mortality as the destination has the largest out-degree value (12). The average degree of the complication network is 10.462.
Several centrality measures were calculated. Firstly, closeness centrality was the largest for renal and ophthalmological complications, AF, neurological complications, ischemic stroke, dementia (all equal to 1.00), followed by HF and peripheral vascular disease (all equal to 0.9), implying their closeness importance in the network. This can be further confirmed by the same results with harmonic closeness centrality which calculates almost the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; same results. Secondly, eccentricity centrality can be interpreted as the easiness of a complication to be reached by all other complications in the network. IHD with AMI, heart failure, IHD without AMI, AMI without known IHD, and peripheral vascular disease have the largest eccentricity value (all equal to 2), indicating these complications are more easily reachable in the pathology network. Thirdly, the betweenness centrality was largest for renal and ophthalmological complications, AF, neurological, ischemic stroke, and dementia (all equal to 0.89), followed by HF and peripheral vascular disease (both with 0.67), implying that they can easily reach others on relatively short paths and lie on considerable fractions of shortest paths connecting others. The ranking results of eigen centrality calculations are almost the same with the betweenness centrality measure, except that ophthalmological also ranks the highest with eigen centrality value as 0.91.
Finally, PageRank value was the highest for mortality (0.09), followed by renal and ophthalmological complications, HF, AF, neurological, ischemic stroke, dementia, and peripheral vascular disease (all equal to 0.08). The clustering coefficient can be used to detect whether complications tend to create tightly knit groups characterized by a relatively high density of connections. Mortality has the largest clustering coefficient (0.94), followed by IHD with or without AMI, and AMI without known IHD (all equal to 0.88). This implies that they tend to form a clique with other neighbor complications in the pathology of diabetes mellitus. The average clustering coefficient of the comorbidity pathology network is 0.87.
The identified sequential patterns provide evidence for identifying diabetes mellitus complication development and shows promising clinical and medical value for diabetes mellitus treatment optimization and even reduce overall mortality.

Sequence prediction results
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; The proposed KCPT+ sequence prediction model was employed to predict the next possible outcome of patients with diabetes mellitus. The dataset with 14,144 patients is randomly split in a five-fold cross-validation way into training dataset (80%, 11,315 patients) and validation dataset (20%, 2,829 patients). We trained all sequence prediction models and then compare their prediction performance on the validation dataset ( Table 10) Besides, we perform KCPT+ model on separate sequence outcome datasets to predict the primary outcomes with previous complication sequences as input. The results (Table 11) show that the model gains the best performance to predict mortality (F1 score 0.90, ACU 0.88), osteoporosis (F1 score 0.86, AUC 0.82), ophthalmological complication (F1 score 0.82, AUC 0.82), IHD with AMI (F1 score 0.81, AUC 0.85), neurological complication (F1 score 0.81, AUC 0.83). The experiment results demonstrate that the proposed model can efficiently predict primary sequence outcomes of diabetes mellitus patients with high accuracy. The model shows the potential to early diagnosis of possible complications and mortality onset based on patients' previous disease sequences as the core module of medical assistant decision systems for healthcare use.

Discussion
In this study, we developed a knowledge enhanced CPT+ (KCPT+) model which considers previously known onset probability of couple, triple, and quadruple sequences to further improve overall prediction ability while at the same time preserve the advantage of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; CPT+ to capture the subsequence similarities without information loss. The main findings of this study are summarized as follows: 1) the median onset age in diabetic patients were identified: ophthalmological complication occurs at the earliest, followed by neurological and renal complications.
2) the commonest couple, triple, quadruple, quintuple, and sextuple sequence patterns in diabetic patients were identified. Easy-for-understanding graphical representation of the sequential complication patterns is presented to identify typical progression trajectories of diabetes-related complications.
3) network analyses were conducted to extract meaningful comorbidity connection properties, identifying meaningful clusters of comorbidities that tend to occur together. 4) an accurate sequence prediction model was developed for predicting the next possible complication (or mortality) with any given prior sequence. The proposed KCPT+ model outperforming other models including CPT, CNN, and LSTM. The sequence prediction model can help clinicians to devise effective treatment strategies for diabetesrelated complications before they develop.
Sequential pattern analysis has been applied in order to aid decision making for changing the treatment dose of insulin in type 1 diabetics [28] and to predict the next prescribed medication for diabetes [29]. In terms of trajectory analysis of disease patterns in diabetes, the study from Korea demonstrated progression trajectory from 1) retinopathy polyneuropathy peripheral vascular disease, and 2) depressive episode musculoskeletal disorders thyroid disorders [9]. By contrast, the study from Denmark found a total of 1,171 significant trajectories. These authors grouped these into patterns centred on key diagnoses such as chronic obstructive pulmonary disorder and gout, which they found to be central for disease progression [10]. In our study, we focused on the trajectory pattern of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; specifically diabetes-related complications, revealing several important trajectory sequences up to six sequential complications.
Compact prediction tree plus (CPT+) [24] has been proposed as a fairly new probabilistic predictive model to assist sequential pattern analysis. The fundamental advantage of CPT+ is that it compresses training sequences without information loss by exploiting similarities between subsequences and is working with low time complexity.
Traditional sequence prediction models make the Markovian assumption that each event solely depends on previous events. This may lead to reduced prediction accuracy [34], i.e., these traditional models are built using only part of the information contained in training sequences (Markov models typically considers only the last k items of training sequences to perform a prediction, where k is the order of the model). However, increasing the order of Markov models often induces a very high state complexity, thus making the model impractical for real applications [35]. Consideration of complete information contained in training sequences (sequential patterns not just dependent on previous events) is expected to improve the overall sequence prediction performance. CPT+ considers the subsequence similarity information to improve prediction accuracy with low time complexity.

Limitations
Several limitations of this study should be noted. Firstly, as this was an administrative database study, -coding and coding error is a possibility. Secondly, given the retrospective nature of this study, missing data may lead to information bias.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ;

Conclusion
This study provides analyses about the sequential pattern characteristics of diseaserelated complications that adversely affect human's quality-of-life. The identified couple, triple, quadruple, quintuple, and sextuple sequence patterns benefit the understanding of the complication development pathology. The proposed accurate complication sequence prediction model can be implemented as a core module of a medical assistant decision system for better risk-stratified care, to enable early complication detection and prevention.

Conflicts of Interests
None.

Funding
None. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

References
The copyright holder for this this version posted December 22, 2020. ;  i  n  g  e  r  l  a  n  d  A  S  ,  H  e  r  m  a  n  W  H  ,  R  e  d  e  k  o  p  W  K  ,  D  i  j  k  s  t  r  a  R  F  ,  J  u  k  e  m  a  J  W  ,  N  i  e  s  s  e  n   L  W  (  2  0  1  3  )  S  t  r  a  t  i  f  i  e  d  P  a  t  i  e  n  t  -C  e  n  t  e  r  e  d  C  a  r  e  i  n  T  y  p  e  2  D  i  a  b  e  t  e  s  .  A  c  l  u  s  t  e  r  r  a  n  d  o  m  i  z  e  d  ,  c  o  n  t  r  o  l  l  e  d  c  l  i  n  i  c  a  l  t  r  i  a  l  o  f  e  f  f  e  c  t  i  v  e  n  e  s  s  a  n  d  c  o  s  t  -e  f  f  e  c  t  i  v  e  n  e  s  s  3  6  (  1  0  )  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 22, 2020. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ; https://doi.org/10.1101/2020.12.21.20248646 doi: medRxiv preprint [  2  6  ]  B  e  n  g  i  o  S  ,  V  i  n  y  a  l  s  O  ,  J  a  i  t  l  y  N  ,  S  h  a  z  e  e  r  N  (  2  0  1  5  )  S  c  h  e  d  u  l  e  d  s  a  m  p  l  i  n  g  f  o  r   s  e  q  u  e  n  c  e  p  r  e  d  i  c  t  i  o  n  w  i  t  h  r  e  c  u  r  r  e  n  t  n  e  u  r  a  l  n  e  t  w  o  r  k  s  .  I  n  :  A  d  v  a  n  c  e  s  i  n  N  e  u  r  a  l   I  n  f  o  r  m  a  t  i  o  n  P  r  o  c  e  s  s  i  n  g  S  y  s  t  e  m  s  ,  p  p  1  1  7  1  -1  1  7  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ;     Sextuple sequence Patients number Neurological -> Ophthalmological -> Renal -> IHD with AMI -> Heart failure -> Mortality 5 Atrial fibrillation -> IHD without AMI -> Neurological -> Ophthalmological -> Renal -> Mortality 3 IHD without AMI -> Ophthalmological -> Renal -> PVD -> Atrial fibrillation -> Mortality 3 Neurological -> Ophthalmological -> Renal -> IHD without AMI -> PVD -> Mortality 3 Ophthalmological -> Renal -> Neurological -> IHD with AMI -> Heart failure -> Mortality 3 Ophthalmological -> Renal -> Neurological -> Ischemic stroke -> Heart failure -> Mortality 3 PVD -> Neurological -> Ophthalmological -> Renal -> IHD without AMI -> Mortality 3 PVD -> Neurological -> Renal -> IHD with AMI -> Heart failure -> Mortality 3 Atrial fibrillation -> AMI without known IHD -> Neurological -> Ophthalmological -> Renal -> Mortality 2 Atrial fibrillation -> Neurological -> Ophthalmological -> Renal -> IHD without AMI -> Ischemic stroke 2   is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 22, 2020. ;