Introduction

Evidence suggests that symptoms associated with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection are highly variable and heterogeneous [1]. Due to this heterogeneity, prediction models for COVID-19 (e.g., models for the general population to predict the risk of being admitted to the hospital or models to support the prognosis of patients with COVID-19) have quickly entered in the literature to support medical decision; however, almost all published prediction models are poorly described [2].

The use of machine learning methods such as clustering algorithms have been increasingly used to investigate the heterogeneity of COVID-19, identifying different clinical phenotypes with similar combinations of traits. Clustering is an unsupervised learning model, meaning that no “a priori” hypotheses about patients’ prognosis need to be injected by the clinicians; therefore, results are data-driven and unbiased from potential previously proposed groupings [3]. In fact, some studies had identified clusters of symptoms associated with suffering SARS-CoV-2 infection [4] or with in-hospitality mortality [5].

There is also increasing evidence supporting the presence of post-COVID symptoms, i.e., symptoms persisting after the acute phase of infection. The prevalence of post-COVID symptoms ranges from 35 to 60% depending on the symptom and the follow-up period [6]. Identification of factors associated with the development of post-COVID symptomatology is needed for an early monitoring of patients at a higher risk, yet the number of studies is still limited [7]. Potential identified risk factors described in the literature include female sex, high symptom load, age, longer hospital stance, and high number of comorbidities; however, these findings are based on studies including samples of < 500 patients and recruited from single centers; therefore, no definitive conclusions can be drawn [6, 7]. In fact, contradictory results are consistently found in the literature in relation to these risk factors [6, 7].

Some attempts have been conducted to identify clusters of patients according to post-COVID symptoms. Huang et al. [8], in a preprint study, identified some clusters of symptoms at the acute phase of infection associated with being long-hauler, but no subgroups of patients were identified. In another preprint, Estiri et al. [9] identified different phenotypes suggesting that the presence of anosmia, dysgeusia, chest pain, or chronic fatigue were indicators of past SARS-CoV-2 infection in the preceding 6-months in young women. Similarly, Ziauddeen et al. [10] identified that post-COVID symptoms broadly clustered into two groups, a majority cluster (88.8%) mostly exhibiting cardiopulmonary symptoms, and a second minority cluster (11.2%) showing more multisystem symptoms. Davis et al. [11] were able to identify three clusters attending to the time course of post-COVID symptoms: one group presenting symptoms that are most likely to occur early at the acute phase of infection (2 weeks), another group presenting symptoms highly stable over time, and a third group with symptoms most likely to increase sharply in the first months after the infection. Previous studies have analyzed clusters of symptoms separately, that is, COVID-19 associated symptoms at the acute phase or post-COVID symptoms [8,9,10,11]. The present study aimed to identify clusters (groups) of COVID-19 survivors exhibiting long-term post-COVID symptoms based on clinical/hospitalization data by using cluster analysis.

Methods

This multicenter study (LONG-COVID-EXP-CM) included patients hospitalized with a positive diagnosis of SARS-CoV-2 by RT-PCR technique and radiological findings during the first wave of the pandemic (March 10th–May 31st, 2020) in five public hospitals in Madrid (Spain). From the total of all hospitalized patients during that period, a sample of 400 individuals from each hospital was randomly selected. The Local Ethics Committees of all hospitals approved the study design (HCSC20/495E, HSO25112020, HUFA20/126, HUIL 092-20, HUF/EC1517). Informed consent was obtained from subjects before collecting data.

Clinical features (i.e., age, gender, height, weight, medical comorbidities), symptoms at hospital admission, and hospitalization data (e.g., days at the hospital, intensive care unit [ICU] admission) were collected from hospital records. Participants were scheduled for a telephone interview conducted by trained healthcare professionals. A predefined list of post-COVID symptoms including fatigue, dyspnea, throat pain, cough, palpitations, anosmia, ageusia, voice problems, hair loss, skin rashes, brain fog, memory loss, musculoskeletal pain, anxiety, depressive symptoms, sleep, or gastrointestinal problems was systematically asked, although participants were free of reporting any post-COVID symptom that they suffered from. The Hospital Anxiety and Depression Scale (HADS) and the Pittsburgh Sleep Quality Index (PSQI) were used to evaluate anxiety/depressive symptoms and sleep quality, since both questionnaires can be properly evaluated by telephone [12]. Both anxiety (HADS-A, 0-21 points) and depressive (HADS-D, 0–21 points) subscales were used. We considered the cut-off scores recommended for Spanish population (HADS-A ≥ 12 points; HADS-D ≥ 10 points) suggestive of anxiety and depressive symptoms, respectively [13]. The PSQI score (0–21points) evaluated the quality of sleep during the past month, where high scores (score > 8.0 points) indicate poor sleep quality [14].

Due to the similarities between the Severe Acute Respiratory Distress Syndrome (SARS) and COVID-19, we used the Functional Impairment Checklist (FIC), a disease-specific tool for evaluating functional consequences of SARS [15]. The FIC includes four items assessing physical symptoms (shortness of breath-dyspnea- at rest or at exertion, fatigue and muscle weakness) and other four items assessing limitations in occupational, leisure/social activities, basic, and instrumental activities of daily living as result of the symptoms [15]. In this study, we analyzed the presence of each item on an individual basis, and we also calculated the FIC-symptoms and FIC-disability scores by reckoning the severity of each item evaluated on four degrees (0, no; 1, mild; 2, moderate; 3, severe).

Clustering analysis

Clustering techniques attempt to find subgroups (i.e., clusters) of patients that are similar among themselves but different from the rest. The simplest and also the most common used clustering algorithm is k-means. Given a number of clusters K, it starts by randomly distributing K centroids (i.e., prototype patients), and assigning all the patients to their closest centroid (in terms of the Euclidean distance); then, the centroids are recomputed as the mean of all patients assigned to them; this process is repeated until convergence. Instead of random initialization, K-means++ method [16] was employed. The results for 2, 3, and 4 clusters were assessed, and, although all three tests yielded similar results, finally only the 3-cluster model is presented here for the sake of brevity. Python library scikit-learn was employed to perform the clustering [17].

Statistical analysis of the clusters

After applying the clustering algorithm, the mean and standard deviation was computed for each feature on each cluster, and a one-way ANOVA test (Holmes–Bonferroni-corrected for multiple comparisons) was employed to find which variables had a statistically different mean value between (at least two of those) the clusters. ANOVA was calculated using Python library Scipy (ver. 1.6.2) [18] and the p-values were corrected with Python library statsmodels (ver. 0.12.1) [19].

Results

A total of 2,000 participants were randomly selected from the involved hospitals and invited to participate. Nine refused to participate, five were not able to be contacted after three attempts, and 17 had deceased after hospital discharge. A total of 1,969 (mean age: 61, SD: 16 years, 46.4% women) were included. Participants were assessed a mean of 8.4 months (SD 1.5, range 6 to 10 months) after hospital discharge.

Three clusters with different distributions in the variables were clearly identified, as visualized in Fig. 1 (clinical and hospitalization data) and Fig. 2 (post-COVID data). Table 1 summarizes clinical and hospitalization data whereas Table 2 shows post-COVID symptoms (also mood disorders) and functional limitations for every cluster. By analyzing Table 1, we can observe that one cluster (number 2) grouped those patients with lower prevalence of medical co-morbidities and lower number of COVID-19 symptoms at hospital admission when compared with the other two (cluster 0–1), particularly, the presence of respiratory symptoms such as dyspnea, cough, and throat pain at the acute phase of infection. Cluster 2 also grouped more males and slightly younger than the other two. Clusters 0 and 1 were not significantly different neither in the prevalence of medical co-morbidities nor in COVID-19 symptoms at hospital admission.

Fig. 1
figure 1

Plots of the distribution of clinical and hospitalization data for each of the three clusters. Categorical features have been represented as bar plots

Fig. 2
figure 2

Plots of the distribution of post-COVID data for each of the three clusters. Categorical features have been represented as bar plots

Table 1 Clinical and hospitalization data according to each cluster
Table 2 Post-COVID symptoms and functional limitations according to each cluster

Table 2 reveals that clusters 0 and 1 grouped individuals with higher number of long-term post-COVID symptoms (particularly respiratory post-COVID symptoms e.g., fatigue or dyspnea but also pain symptoms), more limitations on daily living activities (occupational, leisure/social, basic or instrumental), higher anxiety and depressive levels, and worse sleep quality when compared with cluster 2. Cluster 0 grouped individuals with the highest number of post-COVID symptoms and the most functionally limited, since individuals within cluster 1 exhibited less limitations with daily living activities than those in cluster 0. No significant differences in mood disorders and sleep quality were seen between clusters 0 and 1. Cluster 2 grouped the least affected patients with the lowest number of post-COVID symptoms and almost no limitations with daily living activities.

Discussion

This research shows data from the first study of phenotypic clusters of COVID-19 survivors including COVID-19 associated symptoms at the acute phase of infection (hospital admission), long-term post-COVID symptoms, and functional limitations. We were able to identify three clusters (groups) of COVID-19 survivors: one cluster grouping patients with less affectation at hospital admission (lower number of pre-existing medical comorbidities and lower number of COVID-19 symptoms at the acute phase) and a smaller number of post-COVID symptoms with no functional limitations; two clusters (0 and 1) grouping individuals more affected at hospital admission (greater number of pre-existing comorbidities and more COVID-19 symptoms at the acute phase), a greater number of post-COVID symptoms, more limitations with daily living activities, higher levels of anxiety/depression, and worse sleep quality. Importantly, one cluster grouped those patients with more respiratory post-COVID symptoms and worse functional limitations with daily living activities (cluster 0).

Our study identified three cluster phenotypes in a population of previously hospitalized COVID-19 survivors associating previous medical co-morbidities, COVID-19 symptoms at hospital admission, long-term post-COVID symptoms, and functional repercussions. Cluster analysis grouped individuals according to the number of pre-existing medical comorbidities, the number of COVID-19 associated-symptoms at the acute phase and the number of post-COVID symptoms in the same cluster. This clustering would support the assumptions that a higher symptom load (more symptoms) at the acute phase of SARS-CoV-2 infection and a greater number of pre-existing medical comorbidities are associated with a greater likelihood of post-COVID symptoms, particularly fatigue or dyspnea, 3–6 months after infection [7]. Seeßle et al. [20] have recently observed that the number of COVID-19 associated acute symptoms was also correlated with post-COVID symptoms at 12-months follow-up. Additionally, these factors also agree with recent studies suggesting that post-COVID symptoms are more prevalent in COVID-19 patients reporting severe symptoms at onset (higher symptom load) [21] and severe-to-critical illness (those with higher medical comorbidities) at hospital [22]. Nevertheless, it should be considered that contradictory results in relation to the association between COVID-19 associated onset symptoms and post-COVID symptomatology are found in current literature [6, 7]. Additionally, it is also possible that the presence of pre-existing medical co-morbidities before the infection could contribute to development of post-COVID symptoms; however, preliminary evidence suggests that this association is specific-disease since pre-existing hypertension is associated with a higher number of post-COVID symptoms [23]; whereas diabetes [24], and asthma [25] did not.

An important finding was that one subgroup (cluster 0) exhibited more functional limitations and also more psychological stress than the other group (cluster 1) showing similar number of post-COVID symptoms. Interestingly, the most affected cluster grouped individuals (72%) reporting dyspnea as long-term post-COVID symptom, explaining why these subjects also showed more limitation during daily living activities. In fact, the presence of dyspnea at rest could also be associated with a higher presence of anxiety/depressive symptoms and a poor sleep quality due to a continuous sensation of breathlessness. Since these clusters share common characteristics and sometimes it can be difficult to recognize which cluster a patient belongs to, the development of dyspnea at rest as a post-COVID symptom, could be used for monitoring this subgroup of patients.

Similarly, our analysis also revealed a greater proportion of females in those clusters showing more long-term post-COVID symptoms (clusters 0–1). The topic of female gender as a risk factor for developing post-COVID symptoms is controversial since it is supported by some studies [26, 27] but not by others [28, 29]. Our cluster analysis grouped a higher proportion of females in those groups exhibiting more post-COVID symptoms, supporting that female gender might be a risk factor for long-term COVID symptomatology.

We believe that identification of different clusters may be of great help to clinicians to identify those cases at a higher risk of developing better or worse long-term conditions, thus directing more individualized therapeutic strategies. For instance, previous studies have associated COVID-19 onset symptoms at the acute phase with worse prognosis and higher in-hospital mortality [4, 5]. Our cluster analysis suggests that early identification of patients with a higher symptom load (a greater number of symptoms) at onset could lead to a more individualized symptomatic treatment at hospital admission. Similarly, identification of risk factors associated with the development of dyspnea as a post-COVID symptom could also improve the prognosis of individuals within the most affected group of COVID-19 survivors.

Although, to the best of our knowledge, this is the largest multicenter study investigating a classification system including COVID-19 associated onset symptoms and post-COVID symptoms using cluster analysis, some limitations should be recognized. First, we included hospitalized COVID-19 survivors; therefore, these data should not be extrapolated to non-hospitalized patients. Second, we just included Caucasian participants, extrapolation of current findings to other ethnicities should not be performed. Third, we collected post-COVID symptoms systematically in a predefined list; the first validated and reliable instrument for monitoring symptoms and impact of post-COVID symptoms (Long COVID Symptom and Impact Tools) was recently developed [30]. Finally, defining a true phenotype requires similar clinical and physiological characteristics, underlying pathobiology with identifiable biomarkers, and predictable responses to therapy. Accordingly, it would be necessary to phenotyping each of the identified clusters for a better understanding of their differences.

Conclusions

The application of cluster analysis has identified three cluster of previous hospitalized COVID-19 survivors: one group of patients with lower number of medical comorbidities, lower number of COVID-19 symptoms at the acute phase, lower number of post-COVID symptoms and no functional limitations; and two groups of patients with greater number of medical comorbidities, more COVID-19 symptoms at the acute phase, greater number of post-COVID symptoms, and more limitations with daily living activities. This subgrouping may reflect different mechanisms which should be considered in therapeutic interventions.