Ensemble methods for classification of patients for personalized medicine with high-dimensional data

https://doi.org/10.1016/j.artmed.2007.07.003Get rights and content

Summary

Objective

Personalized medicine is defined by the use of genomic signatures of patients in a target population for assignment of more effective therapies as well as better diagnosis and earlier interventions that might prevent or delay disease. An objective is to find a novel classification algorithm that can be used for prediction of response to therapy in order to help individualize clinical assignment of treatment.

Methods and materials

Classification algorithms are required to be highly accurate for optimal treatment on each patient. Typically, there are numerous genomic and clinical variables over a relatively small number of patients, which presents challenges for most traditional classification algorithms to avoid over-fitting the data. We developed a robust classification algorithm for high-dimensional data based on ensembles of classifiers built from the optimal number of random partitions of the feature space. The software is available on request from the authors.

Results

The proposed algorithm is applied to genomic data sets on lymphoma patients and lung cancer patients to distinguish disease subtypes for optimal treatment and to genomic data on breast cancer patients to identify patients most likely to benefit from adjuvant chemotherapy after surgery. The performance of the proposed algorithm is consistently ranked highly compared to the other classification algorithms.

Conclusion

The statistical classification method for individualized treatment of diseases developed in this study is expected to play a critical role in developing safer and more effective therapies that replace one-size-fits-all drugs with treatments that focus on specific patient needs.

Introduction

Providing guidance on specific therapies for pathologically distinct tumor types to maximize efficacy and minimize toxicity is important for cancer treatment [1], [2]. For clinically heterogeneous diffuse large B-cell lymphoma (DLBCL), there exist two molecularly distinct forms of DLBCL: germinal centre B-like DLBCL and activated B-like DLBCL. Patients with germinal centre B-like DLBCL have significantly better overall survival than those with activated B-like DLBCL [3]. Consequently, they may require less aggressive chemotherapy. For tumors of the lung, the pathological distinction between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) may be troublesome. Early MPM is best treated with extrapleural pneumonectomy followed by chemoradiation, whereas ADCA is treated with chemotherapy alone [4]. Thus, accurate classification of tumor samples and right treatment for distinct tumor types are essential for efficient cancer treatment and prolonged survival on a target population of patients.

Microarray technology has been increasingly used in cancer research because of its potential for classification of tissue samples based only on gene expression data [5], [6]. Microarrays are simply ordered sets of DNA molecules of known sequence. With DNA microarray technology, one can simultaneously measure expression profiles for thousands of genes in tissue samples. Much research involving microarray data analysis is focused on distinguishing between different cancer types using gene expression profiles from disease samples, thereby allowing more accurate diagnosis and effective treatment of each patient.

Gene expression data might also be used to improve disease prognosis in order to prevent some patients from having to undergo painful unsuccessful therapies and unnecessary toxicity. For example, adjuvant chemotherapy for breast cancer after surgery could reduce the risk of distant metastases; however, 70–80% of patients receiving this treatment would be expected to survive metastasis-free without it [7]. Gene expression profiles of sporadic breast cancers could be used to predict metastases better than clinical and histopathological prognostic factors. The strongest predictor variables for metastases, such as lymph node status and histological grade, fail to accurately classify breast tumors according to their clinical behavior [7], [8].

Classification algorithms can be used to process high-dimensional genomic data to distinguish disease subtypes and to predict response to therapy in order to help individualize clinical assignment of treatment. Class prediction is a supervised learning method where the algorithm learns from a training set (known samples) and establishes a prediction rule to classify a test set (new samples). Development of a class prediction algorithm generally consists of a selection of features and fitting a prediction model to develop the classification rule using training samples. Some classification algorithms, such as the classification tree or stepwise logistic regression, perform these simultaneously. Sensitivity (SN), specificity (SP) and accuracy, as well as positive predictive value (PPV) and negative predictive value (NPV) are primary criteria used in the evaluation of the performance of a classification algorithm. The SN is the proportion of correct positive classifications out of the number of true positives. The SP is the proportion of correct negative classifications out of the number of true negatives. The accuracy is the total number of correct classifications out of the total number of samples. The PPV is the probability that a patient is positive given a positive prediction. Its complement, 1-PPV, is the false discovery rate (FDR). The NPV is the probability that a patient is negative given a negative prediction. Algorithms with high SN and high SP as well as high PPV and high NPV, which will have high accuracy, are obviously desirable.

Much research using ensembles of classifiers has been conducted in order to improve classification performance by using different subsets of inputs to train the classifiers. Cherkauer [9] introduced a machine learning system that combines multiple artificial neural networks by simple unweighted averaging to improve classification performances. Tumer and Ghosh [10] addressed methods for reducing the correlations among the individual classifiers by using several re-partitioning schemes of the training sets for training different classifiers. Chen et al. [11] introduced methods for combining multiple classifiers with different features randomly extracted from sample data.

Recently an ensemble-based classification algorithm, Classification by Ensembles from Random Partitions (CERP), has been developed for high-dimensional data [12]. An ensemble of classifiers can form a superior classifier even though individual classifiers might be somewhat weak and error-prone in making decisions [13]. Moreover, an ensemble of ensembles can further enhance class prediction [12]. A technique, for example, using multiple ensembles called MultiBoosting has been introduced, which offers the further advantage over AdaBoost [14].

Recently, three ensemble voting approaches, boosting [15], [16], bagging [17], and random subspace [18], have received attention. There are major differences among boosting, bagging, random subspace and CERP. The same features are used by each classifier in the other methods, while different features are used by each classifier in CERP. Boosting uses the same training samples used by each classifier, with hard-to-classify samples getting more weight by design, while bagging yields incomplete overlap of training samples among classifiers, with some samples getting more weight randomly. Random subspace method combines multiple classification trees constructed in randomly selected subspaces. Boosting, bagging and random subspace tend to cause dependence among classifiers, while the classifiers are expected to be less correlated in an ensemble of CERP.

In this paper, we propose Classification-Tree CERP (C-T CERP), an ensemble of ensembles of optimal numbers of pruned classification trees based on the Classification and Regression Trees (CART) [19] algorithm. As in Logistic Regression Tree CERP (LR-T CERP [12]), we derive the optimal number of classifiers from an adaptive binary search algorithm using a cross-validation (CV). Individual classifiers in an ensemble are constructed from randomly partitioned mutually exclusive subsets of the entire feature space. Our adaptive binary search method can be generalized to any classification algorithm to find an optimal number of classifiers in an ensemble.

The performance of C-T CERP is compared to other well-known classification algorithms: Random Forest (RF) [20], Boosting [15], [21], [22], Decision Forest (DF) [23], Support Vector Machine (SVM) [24], Diagonal Linear Discriminant Analysis (DLDA) [6], Shrunken Centroids (SC) [25], CART, Classification Rule with Unbiased Interaction Selection and Estimation (CRUISE) [26], and Quick, Unbiased and Efficient Statistical Tree (QUEST) [27].

C-T CERP utilizes a group of optimal trees from totally randomized parameter spaces based on mutually exclusive subsets of the entire feature space. On the other hand, RF takes bootstrap samples for each tree and randomly selects predictor variables from the entire feature space at each node. Boosting is a general method for reducing the error of any learning algorithm by a weighted majority vote of the outputs of the weak classifiers. AdaBoost [21] fits an additive model in a base learner by optimizing an exponential loss function. Similarly, LogitBoost [22] fits additive logistic regression models by taking the binomial log-likelihood as a loss function. DF uses an averaging scheme to build an ensemble of trees (not necessarily optimal) from mutually exclusive subsets of the available entire parameter space in a sequential manner. C-T CERP, RF, Boosting and DF are ensemble classifiers. SVM is a kernel-based machine learning approach, which exploits information about the inner products in some feature space. DLDA is a classification rule based on a linear discriminant function. DLDA assumes the same diagonal variance–covariance matrix for all the classes. SC is based on an enhancement of the simple nearest centroid classifier. CART, CRUISE and QUEST are single optimal trees in the sense of minimizing misclassification error and tree complexity. Among these single-tree algorithms, CART and QUEST yield binary trees whereas CRUISE yields multiway splits.

Ensemble learning methods have been applied to high-dimensional data sets [28], [29], [30]. Tan and Gilbert [28] applied ensemble learning (bagged and boosted decision trees) to gene expression data for cancer classification. Long and Vega [29] improved the adaptive boosting algorithm and applied it to several microarray data sets. Chen et al. [30] reported classification ensembles for unbalanced class sizes in predictive toxicology and applied them to high-dimensional data sets.

The proposed algorithm is applied to three published data sets relevant to personalized medicine. The algorithm is first used for the prediction of lymphoma subtypes based on gene-expression in B-cell malignancies among DLBCL patients [3]. Similarly, it is employed on gene-expression data to distinguish MPM from ADCA of the lung in order to identify the treatment that would result in the best possible outcome [4]. Our algorithm is then used to predict which breast cancer patients would benefit from adjuvant chemotherapy after surgery based on gene-expression data [7]. The performance of the classification algorithm is assessed by twenty replications of 10-fold CV.

Section snippets

Methods

The classification problem is to predict the class label Y, based on the gene expression profile X, by constructing a classifierC:XC(X),using a training set such that the misclassification risk, P(C(X)  Y), is as small as possible. When the dimension of gene expression profile m is much smaller than the sample size n, a Bayes classifier or logistic regression can be employed for such a problem. However, it is a well-understood phenomenon that a prediction model built from thousands of available

Results

This section presents the performance (accuracy, SN, SP, PPV, NPV) of C-T CERP along with other various well-known classification algorithms using three published high-dimensional microarray data sets for personalized medicine.

A package in R called RandomForest is used for the RF algorithm. The number of trees is generated using the default of ntree = 500 [20]. The number of features selected at each node in a tree is chosen using the default value of floor(m1/2) [20], where m is the total number

Conclusion and discussion

Recent advancements in biotechnology have accelerated research on the development of molecular biomarkers for the diagnosis and treatment of disease. The Food and Drug Administration envisions clinical tests to identify patients most likely to benefit from particular drugs and patients most likely to experience adverse reactions [41]. Such patient profiling will enable assignment of drug therapies on a scientifically sound predictive basis rather than on an empirical trial-and-error basis.

Acknowledgements

Hongshik Ahn's research was partially supported by the Faculty Research Participation Program at the NCTR administered by the Oak Ridge Institute for Science and Education through an interagency agreement between USDOE and USFDA. The authors would like to thank Dr. T. Lee for downloading and manipulating breast cancer data set.

References (41)

  • H. Ahn et al.

    Classification by ensembles from random partitions of high-dimensional data

    Comput Stat Data Anal

    (2007)
  • Y. Freund et al.

    A decision-theoretic generalization of online learning and an application to boosting

    J Comput Syst Sci

    (1997)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • H. Zhang et al.

    Recursive partitioning for tumor classification with gene expression microarray data

    Proc Natl Acad Sci USA

    (2001)
  • A.A. Alizadeh et al.

    Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • G.J. Gordon et al.

    Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma

    Cancer Res

    (2002)
  • R. Alexandridis et al.

    Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach

    Bioinformatics

    (2004)
  • S. Dudoit et al.

    Comparison of discrimination methods for the classification of tumors using gene expression data

    J Am Stat Assoc

    (2002)
  • L.J. van’t Veer et al.

    Gene expression profiling predicts clinical outcome of breast cancer

    Nature

    (2002)
  • W.L. McGuire

    Breast cancer prognostic factors: evaluation guidelines

    J Natl Cancer Inst

    (1991)
  • K. Cherkauer

    Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks

  • K. Tumer et al.

    Error correlation and error reduction in ensemble classifier

    Connect Sci

    (1996)
  • K. Chen et al.

    Methods of combining multiple classifiers with different features and their applications to text-independent speaker identification

    Int J Pattern Recogn Artif Intell

    (1997)
  • T. Hastie et al.

    The elements of statistical learning: data mining, inference, and prediction

    (2001)
  • G.I. Webb

    MultiBoosting: a technique for combining boosting and wagging

    Mach Learn

    (2000)
  • R. Schapire

    The strength of weak learnability

    Mach Learn

    (1990)
  • Y. Freund et al.

    Experiments with a new boosting algorithm

  • L. Breiman

    Bagging predictors

    Mach Learn

    (1996)
  • T.K. Ho

    The random subspace method for constructing decision forests

    IEEE Trans Pattern Anal Mach Intell

    (1998)
  • L. Breiman et al.

    Classification and regression trees

    (1984)
  • Cited by (83)

    • Effect of data preprocessing on ensemble learning for classification in disease diagnosis

      2024, Communications in Statistics: Simulation and Computation
    View all citing articles on Scopus
    View full text