Stability and scalability in decision trees

Aluja-Banet, Tomàs; Nafria, Eduard

doi:10.1007/BF03354613

Stability and scalability in decision trees

Published: 26 February 2015

Volume 18, pages 505–520, (2003)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Tomàs Aluja-Banet¹ &
Eduard Nafria²

384 Accesses
13 Citations
Explore all metrics

Summary

Tree-based methods are statistical procedures for automatic learning from data, whose main applications are integrated into a data-mining environment for decision support systems. Here, we focus on two problems of decision trees: the stability of the rules obtained and their applicability to huge data sets. Since the tree-growing process is highly dependent on data, i.e. small fluctuations in data can cause big changes in the tree-growing process, we focused instead on the stability of the trees themselves. To this end we propose a series of data diagnostics to prevent internal instability in the tree-growing process before a particular split is made. Indeed, to be effective in actual managerial problems they must be applicable to massive amounts of stored data with maximum efficiency. For this reason we studied the theoretical complexity of such an algorithm. Finally, we present an algorithm that can cope with such problems, with linear cost upon the individuals, which can use a robust impurity measure as a splitting criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

¹ Here, we just consider trees made of binary splits. For a general overview of the state of the art of different decision trees methodologies and problems, the reader can consult Murthy, 1998.
² Misclassification error for a nominal response or quadratic prediction error for a continuous response.
³ The executions have been done in a PC Pentium III-600, with 511 Mb RAM.
⁴ CART is a trade mark of Salford-Systems for a decision tree software.

References

Aluja T., Nafria E. (1996), Automatic segmentation by decision trees, Proceedings on Computational Statistics COMPSTAT 1996, ed. A. Prat, Physica Verlag.
Aluja T., Nafria E. (1998a) Robust impurity measures in Decision Trees, Data Science, Classification and related methods, eds. C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H-H. Bock and Y. Baba, Springer.
Aluja T., Nafria E. (1998b) Generalised impurity measures and data diagnostics in decision trees, Visualising Categorical Data, eds. Jörg Blasius and M. Greenacre, Academic Press.
Arminger G., Enache D., Bonne T. (1997), Analyzing credit risk data: a comparison of logistic discrimination, classification tree analysis and feedforward networks, Computational Statistics, 12(2), 293–310.
MATH Google Scholar
Bravo M.C., Garcia-Santesmases J.M. (2000), Symbolic object description of strata by segmentation trees, Computational Statistics 15(1), 13–24.
Article MATH Google Scholar
Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. (1984), Classification and Regression Trees, Waldsworth International Group, Belmont, California.
MATH Google Scholar
Breiman L. (1996a) Technical note: some properties of splitting criteria, Machine Learning, 24, 41–47.
MATH Google Scholar
Breiman L. (1996b), Bagging predictors, Machine Learning, 24, 123–140.
MATH Google Scholar
Celeux G., Lechevallier Y. (1982), Méthodes de Segmentation non Paramétriques. Revue de Statistique Appliquée, 30(4), 39–53.
MATH Google Scholar
Ciampi A. (1991), Generalized Regression Trees, Computational Statistics and Data Analysis, 12, 57–78.
Article MathSciNet MATH Google Scholar
Conversano C, Mola F., Siciliano R. (2001), Partitioning Algorithms and Combined Model Integration for Data Mining, Computational Statistics 16(3), 323–339.
Article MathSciNet MATH Google Scholar
Greenacre M. (1984), Theory and Application of Correspondence Analysis, Academic Press.
Gueguen A., Nakache J.P. (1988), Méthode de discrimination basée sur la construction d’un arbre de décision binaire, Revue de Statistique Appliquée, 36(1), 19–38.
Google Scholar
Hand D.J. (1997), Construction and Assessment of Classification Rules, J. Wiley.
Hofmann H., Unwin A., Wilhelm A. (2001), Data Mining and Statistics. Introduction, Computational Statistics 16(3), 317–321.
Article MathSciNet MATH Google Scholar
Kass G.V. (1980), An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics, 29(2), 119–127.
Article Google Scholar
Mola F., Siciliano R. (1992), A two-stage predictive splitting algorithm in binary segmentation, Computational Statistics, vol. 1, eds. Y. Dodge and J. Whittaker, Physica Verlag.
Murthy S.K. (1998), Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey, Data Mining and Knowledge Discovery, 2, 345–389.
Article Google Scholar
Pleiffer, K.P., Pesec, B. and Mischak, R. (1994), Stability of Regression Trees, Proceedings in Computational Statistics COMPSTAT 1994, eds. R. Dutter and W. Grossmann, Physica Verlag.
Quinlan J.R., (1996), Bagging, boosting and C4.5, Proceedings Thirteenth National Conference on Artificial Intelligence, 725–730, eds.W. J. Clancey and D. Weld, AAAI Press.
Sonquist J.A., Morgan J.N. (1964), The Detection of Interaction Effects, Ann Arbor: Institute for Social Research, University of Michigan.
Vach W. (1995), Classification trees, Computational Statistics, 10(1), 9–14.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Statistics and Operations Research, Technical University of Catalonia, Jordi Girona 1-3, 08034, Barcelona, Spain
Tomàs Aluja-Banet
Taylor Nelson Sofres Audiencia de Medios, 08190, Barcelona
Eduard Nafria

Authors

Tomàs Aluja-Banet
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Nafria
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aluja-Banet, T., Nafria, E. Stability and scalability in decision trees. Computational Statistics 18, 505–520 (2003). https://doi.org/10.1007/BF03354613

Download citation

Published: 26 February 2015
Issue Date: September 2003
DOI: https://doi.org/10.1007/BF03354613

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Stability and scalability in decision trees

Summary

Access this article

Similar content being viewed by others

A framework to induce more stable decision trees for pattern classification

SPAARC: A Fast Decision Tree Algorithm

ConfDTree: A Statistical Method for Improving Decision Trees

Notes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

Stability and scalability in decision trees

Summary

Access this article

Similar content being viewed by others

A framework to induce more stable decision trees for pattern classification

SPAARC: A Fast Decision Tree Algorithm

ConfDTree: A Statistical Method for Improving Decision Trees

Notes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation