Abstract
Background Studies on Type 2 Diabetes Mellitus (T2DM) have revealed heterogeneous sub-populations in terms of underlying pathologies. However, identification of subpopulations in epidemiological datasets remain unexplored. We here focus on the detection of T2DM clusters in epidemiological data, specifically analysing the National Family Health Survey-4 (NFHS-4) dataset containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients.
Methods Epidemiological data provide challenges for analysis due to the diverse types of features in it. In this case, applying the state-of-the-art dimension reduction tool UMAP conventionally was found to be ineffective for the NFHS-4 dataset, which contains continuous, ordinal and nominal feature types. Continuous features, although smaller in numbers, had an overpowering effect on the distribution of clusters. We implemented a distributed clustering workflow combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.
Findings From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. Application of UMAP based clustering workflow for this type of dataset is novel in itself.
Our analysis reveals four significant clusters, with two of them comprising mainly of non-obese T2DM patients. These non-obese clusters has lower mean age and majorly comprises of rural residents. Surprisingly, one of the obese clusters had 90% of the T2DM patients practising non-vegetarian diet though they did not show an increased intake of plant-based protein-rich foods.
Interpretation Our findings demonstrate the presence of a heterogeneity among T2DM patients with regard to socio-demography and dietary pattern. From our analysis, we conclude that, existence of significant non-obese T2DM subpopulations characterized by younger age group and economic disadvantage, raise the need of different screening criteria for T2DM among rural Indian residents.
Funding This work was in part supported by funds from Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C). The work has also been funded and supported by the Indian Council of Medical research (ICMR) (No.3/1/3/JRF-2017/HRD-LS/56429/54).
1 Introduction
Type 2 Diabetes Mellitus (T2DM) is a multifactorial disease globally estimated to rise to 629 million cases by 2045 (See IDF Diabetes Atlas) [1, 2]. Though conceived as a homogeneous disease for long, several recent studies have found T2DM to be a mix of heterogenous disease subtypes [3, 4, 5]. These studies have reported a varied pathophysiology underlying T2DM and thereby suggest the possibility of a personalised treatment for T2DM.
Besides obesity, other factors like age, sex, socio-economic status, place of residence (rural/urban), smoking habit, alcohol intake, food frequency etc. significantly associate with T2DM [6, 7, 8, 9, 10, 11, 12, 13]. Several of these factors are modifiable in nature and hence are important in the management of T2DM [1]. However, modification of lifestyle-related factors vary and thereby lead to a differential degree of glycemic control among T2DM patients [14]. Glycaemic control and response to anti-diabetics has also been shown to be different among T2DM sub-groups [15]. To explore whether any particular pattern of patient sub-populations exist within the entire T2DM population based on socio-demographic and lifestyle factors, we used an unsupervised clustering approach on the largest and most comprehensive epidemiological dataset in India, the National Family Health Survey-4 (NFHS-4) dataset. Clusters were subsequently characterised to identify unique socio-demographic and lifestyle patterns associated with these sub-populations.
Epidemiological datasets provide a comprehensive set of information regarding socio-demography, lifestyle, addiction and co-morbidities. Variables containing such information are called features in the language of Machine Learning. In the T2DM-NFHS-4 dataset, there are 36 such features, containing information on each diabetes patient. Moreover, in our dataset, the features can be categorised into three types:
Continuous features: These are the features which can assume any numeric value from a continuous range. For example, BMI of a patient is a continuous feature.
Ordinal features: These are the features which assume values from a discrete range, such that, there is a sense of order in the values assumed by the feature. For example, let us assume a feature ‘meat consumption by a patient’, assumes values ‘daily’, ‘weekly’ or ‘monthly’. Clearly the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. Also, there is a sense of order in the values, indicating that daily meat consumption is the highest and daily meat consumption is the lowest, if we want to quantify meat consumption.
Nominal features: These are the features which assume values from a discrete range, such that, there is no sense of order in the values assumed by the feature. For example, let us assume a feature ‘Religion of a patient’, assumes values ‘Hindus’, ‘Muslims’ or ‘Christians’. Clearly the range of the feature ‘meat consumption by a patient’ is discrete, since it can assume any one of the three values. But there is no sense of order in the possible values assumed by the features. Yet, this feature draws its importance from the fact that lifestyle patterns or diets vary largely among these religious groups.
Such diverse types of features in epidemiological data create challenges for the analysis. Conventional application of the state-of-the-art dimension reduction tool Uniform Manifold Approximation (UMAP) was found to be ineffective for the T2DM-NFHS-4 dataset,. Continuous features, although smaller in numbers, had a overpowering effect on the distribution of clusters. To address this problem, we implemented a distributed clustering workflow, combining different similarity measure settings of UMAP, for clustering continuous, ordinal and nominal features separately. We integrated the reduced dimensions from each feature-type-distributed clustering to obtain interpretable and unbiased clustering of the data.
The workflow realised for the present study (Figure 1) involves investigation of underlying socio-demographic patterns within patient sub-populations using unsupervised learning. Dimension reduction approaches are often used to reduce higher dimensional data to lower dimensions such that in the lower dimensional embedding of the data one can visualize underlying clusters within the data, that are not apparent in the higher dimensions [16]. Several such techniques have been developed over the last few decades. Until recently the dimension reduction technique t-Stochastic Neighbourhood Embedding (t-SNE) was a state-of the-art algorithm in this field providing numerous applications in various fields [17, 18, 19]. t-SNE projects high dimensional data to a lower dimension while maintaining the underlying local manifold structure in a sense that, in a lower dimension t-SNE can cluster points, that are close enough in the latent high dimensional manifold [17].
With a rigorous mathematical foundation, considerably high speed and easy to use using scikitlearn API, UMAP has turned out to be one of the most popular choices among the data scientists [20, 21, 22]. As opposed to t-SNE, UMAP uses a graph based manifold approximation mechanism which contributes to preservation of the global as well as Social properties of the latent data manifold in a lower dimensional representation of the data. Given some low dimensional representation of the data, a similar process can be used to construct an equivalent topological representation. UMAP builds a graph considering customized neighbourhoods for every data points. This graph is a representation of the higher dimensional data manifold. The end result is a patchwork of low-dimensional representations of neighbourhoods that groups similar data points on a local scale while better preserving long-range topological connections to more distantly related data points [20, 22]. For the ability of UMAP to preserve the long-range topological connections along with the short-range topological connections and because of its high computational efficiency we choose UMAP for our unsupervised clustering approach. Moreover, UMAP allows an user to specify several similarity measures through the tuning of the metric parameter. This has been critical in our workflow, since our data contains continuous and categorical features and choosing suitable similarity measures for continuous and categorical features is crucial for a meaningful and informative clustering [23].
2 Methodology
2.1 Source and Description of the T2DM NFHS-4 Dataset
Data preparation and pre-processing are the key aspects of approaching a problem from a Machine Learning perspective. In this Section we provide the details on the pre-processing approach adopted to generate the T2DM-NFHS-4 dataset.
The NFHS-4 dataset was downloaded from The Demographic & Health Surveys (DHS) Program website. NFHS-4 is the fourth version of national health survey conducted under the supervision of Ministry of Health and Family Welfare, Government of India with the International Institute for Population Sciences (IIPS), Mumbai serving as the main nodal agency for all the surveys. The sampling procedure followed in NFHS-4 was of stratified two-stage sampling covering all the 640 districts of India. The survey was successfully conducted with 601,509 households. In those interviewed households 112,122 men and 699,686 women could be successfully interviewed. Four survey questionnaires (Household Questionnaire, Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire) were implemented in 17 local languages to collect information on basic demographic information, socio-economic parameters, family planning issues, nutritional status, health indicators, contact with community health workers etc. Uniqueness of the NFHS-4 study was that it collected data on Diabetes status and performed a Random Blood Glucose for individuals (15-54 years) using a finger-stick blood specimen. As a result, the biomarker measurements and tests besides anthropometric measurements like anaemia testing, blood pressure measurement, blood glucose testing and HIV testing were included in the survey.
2.2 Dataset Preparation
For dataset preparation and cleaning, the three questionnaires were merged-Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire. The first two contained information about background characteristics (location, age, sex, religion, social group, literacy, wealth status etc), nutritional practices, addictions and co-morbidities while the bio-marker questionnaire contained information on height, weight, blood pressure and random blood glucose. A unique code was generated for all individuals in all the three questionnaires by appending the Country code and phase, Cluster number, Household number and Line number. The three datasets were joined by the unique code to prepare a single dataset of 810,971 individuals consisting of all men and women between 15-54 years of age. Pregnant women were next excluded to discard the possibility of Gestational Diabetes Mellitus. Individuals with missing diabetic and blood pressure status were also excluded. Variables known to be risk factors for DM (BMI, Age, Place of residence, Wealth Index, Smoking frequency, Alcohol intake frequency, Hypertension), socio-economic factors (Sex, Religion, Social group, Educational status), Dietary frequencies and haemoglobin level were selected for final analysis. BMI, age and haemoglobin level were taken as continuous variables and the rest as categorical variables. Outliers were removed separately for all the three continuous variables to obtain the final dataset with 610498 individuals (526678 females and 83820 males).
2.3 Dataset Preprocessing
We were interested in detecting significant T2DM sub-populations in the data and further sought to characterize these subpopulations based on the socio-demographic and co-morbid conditions. For this purpose, we extracted patients with known history of diabetes from the dataset: a total of 10,125 patients. We considered a diverse collection of socio-demographic and co-morbid conditions as ‘features’ in our dataset. Qualitatively our features can be divided into several categories:
Co-morbid conditions: This class of features considers the co-morbid diseases among T2DM patients. We considered whether a T2DM patient had medical conditions such as Asthma, Thyroid disorder, Heart disease, Cancer, Tuberculosis and Hypertension. Thus, there were six features in this category. These features are binary in nature denoting whether a T2DM patient suffered from a given comorbidity or not.
Food habits: This class of features considered the food habits of T2DM patients. The features considered here were how frequently the patient took the food items: Milk or Curd, Pulses or Beans, Dark leafy vegetables, Fruits, Eggs, Fish, Chicken, Fried food and Aerated drinks. Thus, there were nine features in this category. Features were categorical and ordinal in nature having four possible values: ‘Daily’, ‘Occasionally’, ‘Weekly’ and ‘Never’.
Addiction history: This class of features considered the addiction pattern of T2DM patients. There were two features in this class, both binary in nature encoding whether a patient is a Smoker or whether a patient takes Alcohol.
Socio-demographic features: These included features such as Sex, Age, Wealth index, Education level, Religion and Caste along with Body Mass Index (BMI) and Haemoglobin level of the patient. There were eight features in this category.
Living conditions: This class of features quantify the living conditions of the patients. The features in this class considered whether a patient lives in a household possessing refrigerator, bicycle, motorbike, four wheeler vehicle and livestock. Moreover, there were features denoting type of residence, household structure, frequency of household members smoking inside the house, type of cooking fuel used, source of drinking water and time to reach the nearest drinking water source. Thus, there were eleven features belonging to this category.
For our study, 36 features or factors are considered to investigate significant patient populations among the diabetes patients into consideration. Note that there are both continuous and categorical features among these thirty six features. Among the categorical features there are both ordinal features and nominal features. Ordinal features have a sense of order among them, such as the features from the ‘food habits’ category as described before. The nominal features are categorical features with no sense of order such as sex of a patient. Note that for our dataset the continuous features are: Age, BMI, Haemoglobin level and Time to get to drinking water source; whereas the nominal features are: Sex, Religion, Caste, Household structure, Type of place of residence, Type of cooking fuel and Source of drinking water. The rest of the features are ordinal features. The categorization of features into continuous, nominal and ordinal is of utmost importance in our clustering paradigm which we discuss in Section 2.4.1.
2.4 Identification of T2DM sub-populations using U-MAP and DBSCAN
From our detailed description of our dataset we pointed out that our dataset has a variety of features including continuous and categorical features. Further, there are both ordinal and nominal features among the categorical features in our dataset. A simple UMAP on the entire dataset is depicted in Figure 2(a), revealing two broad clusters. For this clustering UMAP parameters n_neighbours have been chosen to be 30, whereas the metric parameter has been chosen to be euclidean. However we have a number of important nominal and ordinal categorical features whose effect would not be apparent from such a clustering. Moreover, the euclidean distance does not always make sense on categorical features, especially if they are nominal in nature. For example, observe Figure 2(d), where we have used UMAP considering only the nominal features with metric parameter hamming (based on hamming distance). This reveals a completely different picture of the dataset, showing several small clusters. Our clustering paradigm is designed to optimise this effect and find a balance in the clustering where a particular type of feature does not have an overpowering effect on the clustering process.
2.4.1 Clustering paradigm using UMAP
Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories we create a lower dimensional embedding of the dataset. Finally we integrate the lower dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. One advantage of this algorithm is that one does not need to specify the number of clusters from beforehand. DBSCAN considers closely or densely located points, as clusters [24]. For UMAP, we use the same values for the parameters n_neighbours= 30 and min_distance= 0.1 for all the feature types.
For the continuous features we use the metric measure to be Euclidean. The Euclidean distance between two vectors is given by:
For the nominal features we use themetric measure to be Hamming. Hamming distance is defined as: where δ(xi, yi) = 1 if xi = yi and δ(xi, yi) = 0 otherwise. Recall that, nominal features are also a type of categorical features which do not have a sense of order associated to them. For such features Hamming distance is widely used as a similarity measure between data points [23].
For the ordinal features we use the metric measure to be Canberra. It is a weighted version of the Manhattan measure. The Canberra distance is given by:
Ordinal features are also a type of categorical features. However, the Hamming metric can not capture the inherent ordered relationships and statistic information from categorical values [23]. We thus tried using UMAP for several metric measures and noticed that the Canberra distance measure retains a high variance in the lower dimensions. Thus we chose the Canberra distance measure as a similarity metric for ordinal features.
For the categorical and ordinal features we thus produce a two dimensional representation of each data point by taking into consideration the first two UMAP coordinates. For the nominal features we consider we produce a one dimensional representation, since the data points are too scattered in this case as shown in Figure 2(d) and thus can lead to too many clusters. Thus, we reduce every data point into a five dimension representation, two for each of the continuous and ordinal features and one for the nominal features. Finally, we look for clusters in the five dimensional representation using DBSCAN (eps= 1, minpoints= 200). After selecting the final clusters, we characterized them by summarizing all the 36 variables separately for each cluster. The continuous variables were summarized as their mean and the standard error of the mean. The categorical variables were summarized as their frequency distribution and the proportion of each value within each cluster.
2.4.2 Extraction of T2DM sub-populations using DBSCAN
Using our clustering paradigm described before, we can detect seven subpopulations among the patients where 261 patients are considered as outliers. We show the distribution of clusters in Figure 3a. We further perform a UMAP on the five dimensional reduced representation of our data to visualize the clusters detected by DBSCAN. For this we label the data points using the DBSCAN clustering labels and colour code them in the UMAP representation of the five dimensional reduced data as shown in Figure 3b. This provides validation to the fact the clustering done by DBSCAN makes sense. Note that, from our clusters we can detect four significant patient subpopulations containing 2898, 2301, 2226 and 1315 data points.
3 Results
3.1 Characterization of clusters
Age and BMI both were found to be lower in Cluster 2 and Cluster 4
Age and obesity are the most important risk factors for T2DM. However, we found a heterogeneity in both these variables across all the clusters. Interestingly, the mean Age and BMI both were lower in Cluster 2
(Age: 38.3 ± 0.19 years, BMI: 23.9 ± 0.1) and Cluster 4 (Age: 37.9 ± 0.26 years, BMI: 23.6 ± 0.13) compared to Cluster 1 (Age: 41.3 ± 0.14 years, BMI: 26.7 ± 0.09) and Cluster 3 (Age: 39.9 ± 0.18 years, BMI: 26 ± 0.11). However distribution of males and females has been found to be similar across all the clusters.
Higher proportion of rural residents and lower proportion of richest wealth quintile in Cluster 2 and 4
Proportion of rural residents was found to be high in Cluster 2 (69.4% were Rural residents) and Cluster 4 (72.02% were Rural residents) compared to the other clusters (31.3% in
Cluster 1 and 49.19% in Cluster 3). Surprisingly, only 4.3% people in Cluster 2 and 8.37% in Cluster 4 belonged to the richest quintile of the Wealth Index category whereas 64.04% in Cluster 1 and 54.9% in Cluster 3 belonged to the same.
Frequency of co-morbid conditions were similar across all the clusters
Co-morbid conditions included history of asthma, thyroid disease, heart disease, cancer, history of tuberculosis, haemoglobin level and hypertension. Though the distribution of disease conditions show minor variation across the clusters (Table 1), the trend is almost similar in all the clusters.
Lifestyle patterns show evidences of a lower quality of life for patient sub-populations in Cluster 2 and 4
Our analysis reveal several other factors that support the fact that T2DM sub-populations from Cluster 2 and Cluster 4 have a considerably lower quality of life.
We observe that only 0.22% and 24.79% of patients belonging to Cluster 2 and Cluster 4 respectively possess a refrigerator compared to 95.48% and 65.77% of patients belonging to Cluster 1 and Cluster 3 respectively.
Only 30.9% and 32.78% of patients belonging to Cluster 2 and Cluster 4 respectively possess a motorbike compared to 71.53% and 67.03% of patients belonging to Cluster 1 and Cluster 3 respectively.
Only 3.26% and 3.19% of patients belonging to Cluster 2 and Cluster 4 respectively possess a car/truck compared to 23.5% and 17.34% of patients belonging to Cluster 1 and Cluster 3 respectively.
44.24% and 54.98% of patients belonging to Cluster 2 and Cluster 4 respectively, use plant based cooking fuel, which is relatively cheap, compared to 12.22% and 19.63% of patients belonging to Cluster 1 and Cluster 3 respectively. Moreover, only 41.94% and 36.2% of patients belonging to Cluster 2 and Cluster 4 respectively use Gas/Oil based cooking fuel, which is relatively expensive, compared to 84.89% and 70.17% of patients belonging to Cluster 1 and Cluster 3 respectively.
6.35 % and 15.51% of patients belonging to Cluster 2 and Cluster 4 respectively, drink water from unprotected sources, compared to 2.62% and 1.98% of patients belonging to Cluster 1 and Cluster 3 respectively.
Intake of non-vegetarian foods is invariably low in Cluster 3
Around 90% of the population in Cluster 3 had no intake of Egg (89.08%), fish (97.12%), chicken or meat (97.71%) whereas only less than 10% of the population in all the other 3 clusters had no intake of these non-vegetarian foods (Table 1). Though the Cluster 3 population had the highest daily intake of milk/curd (61.81%) and pulses/beans (50.31%) compared to the other clusters, other clusters also had almost similar proportion of people taking milk/curd and pulses/beans daily. Intake of other foods like dark leafy vegetables, fruits, fried foods and aerated drinks showed similar distribution across all the clusters.
4 Discussion
4.1 Rationale of the workflow in clustering epidemiological data
The clustering workflow used arises from some important observations that we will discuss here. To begin with we have a population of 10,125 T2DM patients with a diverse ensemble of features accounting for information on medical history, dietary and addiction habits, socio-economic and lifestyle patterns. Moreover, the features in the considered dataset are also diverse in terms of data types. We have a total of 36 features, out of which 4 are continuous features, 7 nominal features and 25 ordinal features, all of equal importance by assumption.
The aim is to find significant sub-populations in our data such that the identified sub-populations are interpretable in terms of the considered features. Note here that, by significant subpopulations we mean a subpopulation consisting of at least 10 percent of the total population. If there exists such sub-populations and we can explain the subpopulations in terms of the considered features, we can argue that these patterns exist in significant number of patients.
We have already argued in favour of using UMAP for our unsupervised approach to find clusters in the data. However, we observed that applying UMAP algorithm conventionally using the euclidean similarity metric on our entire dataset with 36 features turns out to be ineffective. The reason is, in this case the continuous features have an overpowering effect over the other feature types in determining the distribution of clusters. This can be observed from Figure 2(a) and 2(b). Note that Figure 2(a) shows UMAP clustering with all 36 features and 2(b) shows UMAP clustering with only four continuous features. Note that, there is a similarity in the clustering distribution of these figures, each containing one major cluster and seven small minor clusters. We observed that this is because of the fact that UMAP, when applied on all 36 features of the dataset using euclidean similarity measure is largely biased towards finding similarity among data points only in terms of the continuous features. Given that we have only four continuous features out of 36, this poses a problem as the diverse information present in the dataset in the form of the ordinal and nominal features are largely ignored.
To solve this problem, the clustering of continuous, ordinal and nominal features were treated separately by using different similarity matrices for them, giving rise to our clustering paradigm. We argued on our choice of similarity measures in Section 2.4.1. This generates for each feature type a data representation of lower dimension shown in Figure 2(b-d). We finally integrated these lower dimension data representations by taking two dimensional representations for continuous and ordinal features and an one dimensional representation (the one consisting of the most variance) for nominal features. The reason behind considering one dimensional representation for nominal features, is that using Hamming metrics for such data results in retaining a lot of variance in the data resulting in multiple clusters as we observe in Figure 2(d). Considering a two dimensional representation for this data while integrating these lower dimension data representations carry forward this variance and result in multiple small clusters in the final clustering distribution, which contradicts our aim of finding significantly large sub-populations (of at least 10 percent of the total population).
Finally, the integration is done by applying UMAP on the five dimensional reduced representation of the dataset using euclidean similarity measure (shown in Figure 3b). Note here that, in our final clusters we can observe patterns in all of continuous, ordinal and nominal data types. For example, in Cluster 4 the continuous feature ‘Time to Water source (min)’ shows very high values compared to other clusters. In Cluster 1 and 3, the nominal feature ‘Cooking fuel used’ shows a higher percentage for Gas/Oil users while in Cluster 2 and 4 the same feature shows a higher percentage for plant-based fuel users. In Cluster 3, the ordinal feature ‘Fish intake frequency’ shows a 97 percent of people to be never consuming fish. Thus, we infer that our clustering paradigm enables us to find significant sub-populations while keeping the clustering distribution unbiased, that is no feature type continuous, ordinal and nominal has an overpowering effect on the other.
4.2 Significance of T2DM clusters
T2DM was identified as a homogeneous disease with Insulin Resistance followed by β-cell dysfunction being the underlying pathology. However recent studies have explored and found T2DM to be a heterogeneous entity with the relative contribution of Insulin Resistance and β-cell dysfunction to differ across T2DM clusters [3]. These studies were performed on clinical and biochemical data with variables having uniform data types. On the other hand, our clustering approach takes into account the diverse data types obtained from an epidemiological dataset and discovers clusters among the T2DM population. Interestingly, two of the four clusters obtained in our study belonged to the non-obese T2DM phenotype where the mean BMI was below 25. These two non-obese clusters also had lower mean age compared to the other clusters. Both these non-obese clusters had larger proportion of rural residents and lower proportion of people belonging to the highest wealth quintile concluding to the fact that a large majority of T2DM people from rural India have lower BMI and are younger in age. The T2DM patient subpopulation belonging to these clusters have a relatively lower quality of life judging by analysis the lifestyle pattern based features. The non-obese phenotype of T2DM has been increasingly reported over the last two decades raising concern about the uniqueness of its underlying pathophysiology with a greater contribution of β-cell dysfunction compared to Insulin Resistance [25, 26, 27, 28]. This non-obese T2DM phenotype has been found among Asians and studies depicting and investigating its similarities and differences has been in place. Studies have concluded T2DM to occur among the Asians at a lower BMI cut-off and also at a younger age [29, 30]. This finding of two non-obese clusters with lower mean age provides confirmation to this.
Though non-obese T2DM is being considered as a unique phenotype, epidemiological studies for identifying high-risk population groups still remain undone. This is especially important for many Asian countries where over half of the T2DM population is of non-obese phenotype [25]. This analysis, reporting an increased presence of Rural residents in both the non-obese T2DM clusters, calls for a modification in BMI and Age cut-off for T2DM screening among rural residents. However identification of risk factors for T2DM specific to the rural population needs to be done. Representation of people from the highest wealth quintile was much lower in both the non-obese T2DM clusters. T2DM is a multi-factorial disease requiring strict compliance to lifestyle modification, proper diet and anti-diabetic therapy. Non-obese T2DM clusters with reduced representation from the highest wealth quintile suggests the possibility of an unequal access to care for non-obese T2DM people thereby generating the need of a more equitable healthcare policy in terms of prevention and therapy.
On the other hand, both the obese T2DM clusters had higher age and more urban residents. The proportion of people from the highest wealth quintile was higher in both the obese clusters. Interestingly one of the obese clusters (Cluster 3) had invariably low intake of non-vegetarian foods (egg, fish, chicken and meat) pointing out to the fact this T2DM cluster comprised of non-vegetarian people mainly. Dietary requirements in diagnosed T2DM patients involves reduced amount of carbohydrates and fats with increased amount of protein-rich foods [31]. Animal products, being rich sources of dietary protein, need to be included in the diet. One of the obese T2DM clusters with a strict non-vegetarian dietary pattern suggests the need to design a proper dietary guidelines for this group.
5 Conclusion
From a data science perspective, this analysis addresses the issue of diverse data types. We have shown that for such data conventional application of dimension reduction approaches might not be fruitful. We develop a workflow that contributes to finding meaningful and interpretable clusters such that the distribution of clusters is not biased by the data types.
Existence of a significant non-obese T2DM patient sub-population belonging to younger age group and having larger proportions of rural residents raises with a lower quality of life, indicate the need of a different screening criteria for T2DM among rural Indian residents. The obese T2DM cluster with around 90% of people sticking to the non-vegetarian diet calls for the need of dietary guidelines for T2DM patients having a non-vegetarian dietary pattern.
Data Availability
We support the idea of transparency and reproducibility of research. Therefore, all data relevant to this work are made publicly available on a GitHub repository.
https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Preprocessed_DM_xx.zip
Data availability
We support the idea of transparency and reproducibility of research. Therefore, all data relevant to this work are made publicly available on the GitHub repository https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Preprocessed_DM_xx.zip. More-over, the python code (in form of a jupyer notebook) for the implementation of our workflow is also provided publicly in https://github.com/Saptarshi-Bej/Type-2-Diabetes-Mellitus-T2DM-/blob/master/Clustering_paradigm_disc_cont.ipynb.
Author Contributions
Saptarshi Bej and Jit Sarkar are the first authors and contributed equally to this work. Saptarshi Bej, Jit Sarkar, Pabitra Mitra, Partha Chakrabarti and Olaf Wolkenhauer contributed to the study concept and design. Saptarshi Bej, Jit Sarkar and Saikat Biswas did the data analysis. Saptarshi Bej, Jit Sarkar and Olaf Wolkenhauer wrote the manuscript and are the guarantors of this work having full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All authors approved the final version of the article, including the authorship list.
Disclosure Summary
The authors declare no conflict of interest.
Acknowledgements
This work was in part supported by funds from Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C). JS received a research fellowship from Indian Council of Medical research (ICMR) (No.3/1/3/JRF-2017/HRD-LS/56429/54).
Footnotes
We added some more interpretations on our results