Ensemble methods for classification of patients for personalized medicine with high-dimensional data
Introduction
Providing guidance on specific therapies for pathologically distinct tumor types to maximize efficacy and minimize toxicity is important for cancer treatment [1], [2]. For clinically heterogeneous diffuse large B-cell lymphoma (DLBCL), there exist two molecularly distinct forms of DLBCL: germinal centre B-like DLBCL and activated B-like DLBCL. Patients with germinal centre B-like DLBCL have significantly better overall survival than those with activated B-like DLBCL [3]. Consequently, they may require less aggressive chemotherapy. For tumors of the lung, the pathological distinction between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) may be troublesome. Early MPM is best treated with extrapleural pneumonectomy followed by chemoradiation, whereas ADCA is treated with chemotherapy alone [4]. Thus, accurate classification of tumor samples and right treatment for distinct tumor types are essential for efficient cancer treatment and prolonged survival on a target population of patients.
Microarray technology has been increasingly used in cancer research because of its potential for classification of tissue samples based only on gene expression data [5], [6]. Microarrays are simply ordered sets of DNA molecules of known sequence. With DNA microarray technology, one can simultaneously measure expression profiles for thousands of genes in tissue samples. Much research involving microarray data analysis is focused on distinguishing between different cancer types using gene expression profiles from disease samples, thereby allowing more accurate diagnosis and effective treatment of each patient.
Gene expression data might also be used to improve disease prognosis in order to prevent some patients from having to undergo painful unsuccessful therapies and unnecessary toxicity. For example, adjuvant chemotherapy for breast cancer after surgery could reduce the risk of distant metastases; however, 70–80% of patients receiving this treatment would be expected to survive metastasis-free without it [7]. Gene expression profiles of sporadic breast cancers could be used to predict metastases better than clinical and histopathological prognostic factors. The strongest predictor variables for metastases, such as lymph node status and histological grade, fail to accurately classify breast tumors according to their clinical behavior [7], [8].
Classification algorithms can be used to process high-dimensional genomic data to distinguish disease subtypes and to predict response to therapy in order to help individualize clinical assignment of treatment. Class prediction is a supervised learning method where the algorithm learns from a training set (known samples) and establishes a prediction rule to classify a test set (new samples). Development of a class prediction algorithm generally consists of a selection of features and fitting a prediction model to develop the classification rule using training samples. Some classification algorithms, such as the classification tree or stepwise logistic regression, perform these simultaneously. Sensitivity (SN), specificity (SP) and accuracy, as well as positive predictive value (PPV) and negative predictive value (NPV) are primary criteria used in the evaluation of the performance of a classification algorithm. The SN is the proportion of correct positive classifications out of the number of true positives. The SP is the proportion of correct negative classifications out of the number of true negatives. The accuracy is the total number of correct classifications out of the total number of samples. The PPV is the probability that a patient is positive given a positive prediction. Its complement, 1-PPV, is the false discovery rate (FDR). The NPV is the probability that a patient is negative given a negative prediction. Algorithms with high SN and high SP as well as high PPV and high NPV, which will have high accuracy, are obviously desirable.
Much research using ensembles of classifiers has been conducted in order to improve classification performance by using different subsets of inputs to train the classifiers. Cherkauer [9] introduced a machine learning system that combines multiple artificial neural networks by simple unweighted averaging to improve classification performances. Tumer and Ghosh [10] addressed methods for reducing the correlations among the individual classifiers by using several re-partitioning schemes of the training sets for training different classifiers. Chen et al. [11] introduced methods for combining multiple classifiers with different features randomly extracted from sample data.
Recently an ensemble-based classification algorithm, Classification by Ensembles from Random Partitions (CERP), has been developed for high-dimensional data [12]. An ensemble of classifiers can form a superior classifier even though individual classifiers might be somewhat weak and error-prone in making decisions [13]. Moreover, an ensemble of ensembles can further enhance class prediction [12]. A technique, for example, using multiple ensembles called MultiBoosting has been introduced, which offers the further advantage over AdaBoost [14].
Recently, three ensemble voting approaches, boosting [15], [16], bagging [17], and random subspace [18], have received attention. There are major differences among boosting, bagging, random subspace and CERP. The same features are used by each classifier in the other methods, while different features are used by each classifier in CERP. Boosting uses the same training samples used by each classifier, with hard-to-classify samples getting more weight by design, while bagging yields incomplete overlap of training samples among classifiers, with some samples getting more weight randomly. Random subspace method combines multiple classification trees constructed in randomly selected subspaces. Boosting, bagging and random subspace tend to cause dependence among classifiers, while the classifiers are expected to be less correlated in an ensemble of CERP.
In this paper, we propose Classification-Tree CERP (C-T CERP), an ensemble of ensembles of optimal numbers of pruned classification trees based on the Classification and Regression Trees (CART) [19] algorithm. As in Logistic Regression Tree CERP (LR-T CERP [12]), we derive the optimal number of classifiers from an adaptive binary search algorithm using a cross-validation (CV). Individual classifiers in an ensemble are constructed from randomly partitioned mutually exclusive subsets of the entire feature space. Our adaptive binary search method can be generalized to any classification algorithm to find an optimal number of classifiers in an ensemble.
The performance of C-T CERP is compared to other well-known classification algorithms: Random Forest (RF) [20], Boosting [15], [21], [22], Decision Forest (DF) [23], Support Vector Machine (SVM) [24], Diagonal Linear Discriminant Analysis (DLDA) [6], Shrunken Centroids (SC) [25], CART, Classification Rule with Unbiased Interaction Selection and Estimation (CRUISE) [26], and Quick, Unbiased and Efficient Statistical Tree (QUEST) [27].
C-T CERP utilizes a group of optimal trees from totally randomized parameter spaces based on mutually exclusive subsets of the entire feature space. On the other hand, RF takes bootstrap samples for each tree and randomly selects predictor variables from the entire feature space at each node. Boosting is a general method for reducing the error of any learning algorithm by a weighted majority vote of the outputs of the weak classifiers. AdaBoost [21] fits an additive model in a base learner by optimizing an exponential loss function. Similarly, LogitBoost [22] fits additive logistic regression models by taking the binomial log-likelihood as a loss function. DF uses an averaging scheme to build an ensemble of trees (not necessarily optimal) from mutually exclusive subsets of the available entire parameter space in a sequential manner. C-T CERP, RF, Boosting and DF are ensemble classifiers. SVM is a kernel-based machine learning approach, which exploits information about the inner products in some feature space. DLDA is a classification rule based on a linear discriminant function. DLDA assumes the same diagonal variance–covariance matrix for all the classes. SC is based on an enhancement of the simple nearest centroid classifier. CART, CRUISE and QUEST are single optimal trees in the sense of minimizing misclassification error and tree complexity. Among these single-tree algorithms, CART and QUEST yield binary trees whereas CRUISE yields multiway splits.
Ensemble learning methods have been applied to high-dimensional data sets [28], [29], [30]. Tan and Gilbert [28] applied ensemble learning (bagged and boosted decision trees) to gene expression data for cancer classification. Long and Vega [29] improved the adaptive boosting algorithm and applied it to several microarray data sets. Chen et al. [30] reported classification ensembles for unbalanced class sizes in predictive toxicology and applied them to high-dimensional data sets.
The proposed algorithm is applied to three published data sets relevant to personalized medicine. The algorithm is first used for the prediction of lymphoma subtypes based on gene-expression in B-cell malignancies among DLBCL patients [3]. Similarly, it is employed on gene-expression data to distinguish MPM from ADCA of the lung in order to identify the treatment that would result in the best possible outcome [4]. Our algorithm is then used to predict which breast cancer patients would benefit from adjuvant chemotherapy after surgery based on gene-expression data [7]. The performance of the classification algorithm is assessed by twenty replications of 10-fold CV.
Section snippets
Methods
The classification problem is to predict the class label Y, based on the gene expression profile X, by constructing a classifierusing a training set such that the misclassification risk, P(C(X) ≠ Y), is as small as possible. When the dimension of gene expression profile m is much smaller than the sample size n, a Bayes classifier or logistic regression can be employed for such a problem. However, it is a well-understood phenomenon that a prediction model built from thousands of available
Results
This section presents the performance (accuracy, SN, SP, PPV, NPV) of C-T CERP along with other various well-known classification algorithms using three published high-dimensional microarray data sets for personalized medicine.
A package in R called RandomForest is used for the RF algorithm. The number of trees is generated using the default of ntree = 500 [20]. The number of features selected at each node in a tree is chosen using the default value of floor(m1/2) [20], where m is the total number
Conclusion and discussion
Recent advancements in biotechnology have accelerated research on the development of molecular biomarkers for the diagnosis and treatment of disease. The Food and Drug Administration envisions clinical tests to identify patients most likely to benefit from particular drugs and patients most likely to experience adverse reactions [41]. Such patient profiling will enable assignment of drug therapies on a scientifically sound predictive basis rather than on an empirical trial-and-error basis.
Acknowledgements
Hongshik Ahn's research was partially supported by the Faculty Research Participation Program at the NCTR administered by the Oak Ridge Institute for Science and Education through an interagency agreement between USDOE and USFDA. The authors would like to thank Dr. T. Lee for downloading and manipulating breast cancer data set.
References (41)
- et al.
Classification by ensembles from random partitions of high-dimensional data
Comput Stat Data Anal
(2007) - et al.
A decision-theoretic generalization of online learning and an application to boosting
J Comput Syst Sci
(1997) - et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999) - et al.
Recursive partitioning for tumor classification with gene expression microarray data
Proc Natl Acad Sci USA
(2001) - et al.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature
(2000) - et al.
Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma
Cancer Res
(2002) - et al.
Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach
Bioinformatics
(2004) - et al.
Comparison of discrimination methods for the classification of tumors using gene expression data
J Am Stat Assoc
(2002) - et al.
Gene expression profiling predicts clinical outcome of breast cancer
Nature
(2002) Breast cancer prognostic factors: evaluation guidelines
J Natl Cancer Inst
(1991)
Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks
Error correlation and error reduction in ensemble classifier
Connect Sci
Methods of combining multiple classifiers with different features and their applications to text-independent speaker identification
Int J Pattern Recogn Artif Intell
The elements of statistical learning: data mining, inference, and prediction
MultiBoosting: a technique for combining boosting and wagging
Mach Learn
The strength of weak learnability
Mach Learn
Experiments with a new boosting algorithm
Bagging predictors
Mach Learn
The random subspace method for constructing decision forests
IEEE Trans Pattern Anal Mach Intell
Classification and regression trees
Cited by (83)
Hyperspectral image classification method based on semantic filtering and ensemble learning
2023, Infrared Physics and TechnologyDeveloping predictive models for early detection of intervertebral disc degeneration risk
2022, Healthcare AnalyticsEffect of data preprocessing on ensemble learning for classification in disease diagnosis
2024, Communications in Statistics: Simulation and Computation