Postponing the evaluation of attributes with a high number of boundary points

Elomaa, Tapio; Rousu, Juho

doi:10.1007/BFb0094823

Tapio Elomaa &
Juho Rousu

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1510))

Included in the following conference series:

European Symposium on Principles of Data Mining and Knowledge Discovery

303 Accesses

Abstract

The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain.

Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute’s relevance in class prediction. These analytical bounds, however, are too loose for practical purposes.

We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.

Joint Research Centre, European Commission

VTT Biotechnology and Food Research

Download to read the full chapter text

Chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97 (1997) 245–271
Article MATH MathSciNet Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Brodley, C., Friedl, M.: Identifying and eliminating mislabeled training instances. In: Proc. 13th Natl. Conf. on Artificial Intelligence. AAAI Press (1996) 799–805
Google Scholar
Caruana, R., Freitag, D.: Greedy attribute selection. In: Machine Learning: Proc. 11th Intl. Conf. Morgan Kaufmann (1994) 28–36
Google Scholar
Codrington, C., Brodley, C.: On the qualitative behavior of impurity-based splitting rules I: The minima-free property. Tech. Rep. 97-5. Purdue Univ., School of Electrical and Computer Engineering, 1997
Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley & Sons (1991)
Google Scholar
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. (to appear)
Google Scholar
Elomaa, T., Rousu, J.: General and efficient multisplitting of numerical attributes. Mach. Learn. (to appear)
Google Scholar
Fayyad, U., Irani, K.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8 (1992) 87–102
MATH Google Scholar
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. 13th Intl. Joint Conf. on Artificial Intelligence. Morgan Kaufmann (1993) 1022–1027
Google Scholar
Fulton, T., Kasif, S., Salzberg, S.: Efficient algorithms for finding multi-way splits for decision trees. In: Machine Learning: Proc. 12th Intl. Conf. Morgan Kaufmann (1995) 244–251
Google Scholar
John, G.: Robust decision trees: Removing outliers from data. In: Proc. 1st Intl. Conf. on Knowledge Discovery and Data Mining. AAAI Press (1995) 174–179
Google Scholar
Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97 (1997) 273–324
Article MATH Google Scholar
Kononenko, I., Bratko, I., Roŝkar, E.: Experiments in automatic learning of medical diagnostic rules. Tech. Rep. Josef Stefan Institute (1984)
Google Scholar
López de Màntaras, R.: A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6 (1991) 81–92
Google Scholar
Merz, C., Murphy, P.: UCI repository of machine learning databases. http://www.ics.uci.edu/tmlearn/MLRepository.html.
Google Scholar
Oates, T., Jensen, D.: Large datasets lead to overly complex models: an explanation and a solution. In: Proc. 4th Intl. Conf. on Knowledge Discovery and Data Mining. AAAI Press (to appear)
Google Scholar
Quinlan, R.: Learning efficient classification procedures and their application to chess end games. In: Michalski, R., Carbonell, J., Mitchell, T. (eds.): Machine Learning: An Artificial Intelligence Approach. Tioga (1983) 391–411
Google Scholar
Quinlan, R.: Induction of decision trees. Mach. Learn. 1 (1986) 81–106
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Google Scholar

Download references

Authors

Tapio Elomaa
View author publications
You can also search for this author in PubMed Google Scholar
Juho Rousu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jan M. Żytkow Mohamed Quafafou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elomaa, T., Rousu, J. (1998). Postponing the evaluation of attributes with a high number of boundary points. In: Żytkow, J.M., Quafafou, M. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1998. Lecture Notes in Computer Science, vol 1510. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0094823

Download citation

DOI: https://doi.org/10.1007/BFb0094823
Published: 19 October 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65068-3
Online ISBN: 978-3-540-49687-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics