Skip to main content
Log in

Stability and scalability in decision trees

  • Published:
Computational Statistics Aims and scope Submit manuscript

Summary

Tree-based methods are statistical procedures for automatic learning from data, whose main applications are integrated into a data-mining environment for decision support systems. Here, we focus on two problems of decision trees: the stability of the rules obtained and their applicability to huge data sets. Since the tree-growing process is highly dependent on data, i.e. small fluctuations in data can cause big changes in the tree-growing process, we focused instead on the stability of the trees themselves. To this end we propose a series of data diagnostics to prevent internal instability in the tree-growing process before a particular split is made. Indeed, to be effective in actual managerial problems they must be applicable to massive amounts of stored data with maximum efficiency. For this reason we studied the theoretical complexity of such an algorithm. Finally, we present an algorithm that can cope with such problems, with linear cost upon the individuals, which can use a robust impurity measure as a splitting criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. 1 Here, we just consider trees made of binary splits. For a general overview of the state of the art of different decision trees methodologies and problems, the reader can consult Murthy, 1998.

  2. 2 Misclassification error for a nominal response or quadratic prediction error for a continuous response.

  3. 3 The executions have been done in a PC Pentium III-600, with 511 Mb RAM.

  4. 4 CART is a trade mark of Salford-Systems for a decision tree software.

References

  • Aluja T., Nafria E. (1996), Automatic segmentation by decision trees, Proceedings on Computational Statistics COMPSTAT 1996, ed. A. Prat, Physica Verlag.

  • Aluja T., Nafria E. (1998a) Robust impurity measures in Decision Trees, Data Science, Classification and related methods, eds. C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H-H. Bock and Y. Baba, Springer.

  • Aluja T., Nafria E. (1998b) Generalised impurity measures and data diagnostics in decision trees, Visualising Categorical Data, eds. Jörg Blasius and M. Greenacre, Academic Press.

  • Arminger G., Enache D., Bonne T. (1997), Analyzing credit risk data: a comparison of logistic discrimination, classification tree analysis and feedforward networks, Computational Statistics, 12(2), 293–310.

    MATH  Google Scholar 

  • Bravo M.C., Garcia-Santesmases J.M. (2000), Symbolic object description of strata by segmentation trees, Computational Statistics 15(1), 13–24.

    Article  MATH  Google Scholar 

  • Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. (1984), Classification and Regression Trees, Waldsworth International Group, Belmont, California.

    MATH  Google Scholar 

  • Breiman L. (1996a) Technical note: some properties of splitting criteria, Machine Learning, 24, 41–47.

    MATH  Google Scholar 

  • Breiman L. (1996b), Bagging predictors, Machine Learning, 24, 123–140.

    MATH  Google Scholar 

  • Celeux G., Lechevallier Y. (1982), Méthodes de Segmentation non Paramétriques. Revue de Statistique Appliquée, 30(4), 39–53.

    MATH  Google Scholar 

  • Ciampi A. (1991), Generalized Regression Trees, Computational Statistics and Data Analysis, 12, 57–78.

    Article  MathSciNet  MATH  Google Scholar 

  • Conversano C, Mola F., Siciliano R. (2001), Partitioning Algorithms and Combined Model Integration for Data Mining, Computational Statistics 16(3), 323–339.

    Article  MathSciNet  MATH  Google Scholar 

  • Greenacre M. (1984), Theory and Application of Correspondence Analysis, Academic Press.

  • Gueguen A., Nakache J.P. (1988), Méthode de discrimination basée sur la construction d’un arbre de décision binaire, Revue de Statistique Appliquée, 36(1), 19–38.

    Google Scholar 

  • Hand D.J. (1997), Construction and Assessment of Classification Rules, J. Wiley.

  • Hofmann H., Unwin A., Wilhelm A. (2001), Data Mining and Statistics. Introduction, Computational Statistics 16(3), 317–321.

    Article  MathSciNet  MATH  Google Scholar 

  • Kass G.V. (1980), An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics, 29(2), 119–127.

    Article  Google Scholar 

  • Mola F., Siciliano R. (1992), A two-stage predictive splitting algorithm in binary segmentation, Computational Statistics, vol. 1, eds. Y. Dodge and J. Whittaker, Physica Verlag.

  • Murthy S.K. (1998), Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey, Data Mining and Knowledge Discovery, 2, 345–389.

    Article  Google Scholar 

  • Pleiffer, K.P., Pesec, B. and Mischak, R. (1994), Stability of Regression Trees, Proceedings in Computational Statistics COMPSTAT 1994, eds. R. Dutter and W. Grossmann, Physica Verlag.

  • Quinlan J.R., (1996), Bagging, boosting and C4.5, Proceedings Thirteenth National Conference on Artificial Intelligence, 725–730, eds.W. J. Clancey and D. Weld, AAAI Press.

  • Sonquist J.A., Morgan J.N. (1964), The Detection of Interaction Effects, Ann Arbor: Institute for Social Research, University of Michigan.

  • Vach W. (1995), Classification trees, Computational Statistics, 10(1), 9–14.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aluja-Banet, T., Nafria, E. Stability and scalability in decision trees. Computational Statistics 18, 505–520 (2003). https://doi.org/10.1007/BF03354613

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03354613

Key words

Navigation