Summary
Tree-based methods are statistical procedures for automatic learning from data, whose main applications are integrated into a data-mining environment for decision support systems. Here, we focus on two problems of decision trees: the stability of the rules obtained and their applicability to huge data sets. Since the tree-growing process is highly dependent on data, i.e. small fluctuations in data can cause big changes in the tree-growing process, we focused instead on the stability of the trees themselves. To this end we propose a series of data diagnostics to prevent internal instability in the tree-growing process before a particular split is made. Indeed, to be effective in actual managerial problems they must be applicable to massive amounts of stored data with maximum efficiency. For this reason we studied the theoretical complexity of such an algorithm. Finally, we present an algorithm that can cope with such problems, with linear cost upon the individuals, which can use a robust impurity measure as a splitting criterion.
Similar content being viewed by others
Notes
1 Here, we just consider trees made of binary splits. For a general overview of the state of the art of different decision trees methodologies and problems, the reader can consult Murthy, 1998.
2 Misclassification error for a nominal response or quadratic prediction error for a continuous response.
3 The executions have been done in a PC Pentium III-600, with 511 Mb RAM.
4 CART is a trade mark of Salford-Systems for a decision tree software.
References
Aluja T., Nafria E. (1996), Automatic segmentation by decision trees, Proceedings on Computational Statistics COMPSTAT 1996, ed. A. Prat, Physica Verlag.
Aluja T., Nafria E. (1998a) Robust impurity measures in Decision Trees, Data Science, Classification and related methods, eds. C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H-H. Bock and Y. Baba, Springer.
Aluja T., Nafria E. (1998b) Generalised impurity measures and data diagnostics in decision trees, Visualising Categorical Data, eds. Jörg Blasius and M. Greenacre, Academic Press.
Arminger G., Enache D., Bonne T. (1997), Analyzing credit risk data: a comparison of logistic discrimination, classification tree analysis and feedforward networks, Computational Statistics, 12(2), 293–310.
Bravo M.C., Garcia-Santesmases J.M. (2000), Symbolic object description of strata by segmentation trees, Computational Statistics 15(1), 13–24.
Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. (1984), Classification and Regression Trees, Waldsworth International Group, Belmont, California.
Breiman L. (1996a) Technical note: some properties of splitting criteria, Machine Learning, 24, 41–47.
Breiman L. (1996b), Bagging predictors, Machine Learning, 24, 123–140.
Celeux G., Lechevallier Y. (1982), Méthodes de Segmentation non Paramétriques. Revue de Statistique Appliquée, 30(4), 39–53.
Ciampi A. (1991), Generalized Regression Trees, Computational Statistics and Data Analysis, 12, 57–78.
Conversano C, Mola F., Siciliano R. (2001), Partitioning Algorithms and Combined Model Integration for Data Mining, Computational Statistics 16(3), 323–339.
Greenacre M. (1984), Theory and Application of Correspondence Analysis, Academic Press.
Gueguen A., Nakache J.P. (1988), Méthode de discrimination basée sur la construction d’un arbre de décision binaire, Revue de Statistique Appliquée, 36(1), 19–38.
Hand D.J. (1997), Construction and Assessment of Classification Rules, J. Wiley.
Hofmann H., Unwin A., Wilhelm A. (2001), Data Mining and Statistics. Introduction, Computational Statistics 16(3), 317–321.
Kass G.V. (1980), An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics, 29(2), 119–127.
Mola F., Siciliano R. (1992), A two-stage predictive splitting algorithm in binary segmentation, Computational Statistics, vol. 1, eds. Y. Dodge and J. Whittaker, Physica Verlag.
Murthy S.K. (1998), Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey, Data Mining and Knowledge Discovery, 2, 345–389.
Pleiffer, K.P., Pesec, B. and Mischak, R. (1994), Stability of Regression Trees, Proceedings in Computational Statistics COMPSTAT 1994, eds. R. Dutter and W. Grossmann, Physica Verlag.
Quinlan J.R., (1996), Bagging, boosting and C4.5, Proceedings Thirteenth National Conference on Artificial Intelligence, 725–730, eds.W. J. Clancey and D. Weld, AAAI Press.
Sonquist J.A., Morgan J.N. (1964), The Detection of Interaction Effects, Ann Arbor: Institute for Social Research, University of Michigan.
Vach W. (1995), Classification trees, Computational Statistics, 10(1), 9–14.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Aluja-Banet, T., Nafria, E. Stability and scalability in decision trees. Computational Statistics 18, 505–520 (2003). https://doi.org/10.1007/BF03354613
Published:
Issue Date:
DOI: https://doi.org/10.1007/BF03354613