Abstract
The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain.
Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute’s relevance in class prediction. These analytical bounds, however, are too loose for practical purposes.
We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.
Joint Research Centre, European Commission
VTT Biotechnology and Food Research
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97 (1997) 245–271
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Brodley, C., Friedl, M.: Identifying and eliminating mislabeled training instances. In: Proc. 13th Natl. Conf. on Artificial Intelligence. AAAI Press (1996) 799–805
Caruana, R., Freitag, D.: Greedy attribute selection. In: Machine Learning: Proc. 11th Intl. Conf. Morgan Kaufmann (1994) 28–36
Codrington, C., Brodley, C.: On the qualitative behavior of impurity-based splitting rules I: The minima-free property. Tech. Rep. 97-5. Purdue Univ., School of Electrical and Computer Engineering, 1997
Cover, T., Thomas, J.: Elements of Information Theory. Wiley & Sons (1991)
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. (to appear)
Elomaa, T., Rousu, J.: General and efficient multisplitting of numerical attributes. Mach. Learn. (to appear)
Fayyad, U., Irani, K.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8 (1992) 87–102
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. 13th Intl. Joint Conf. on Artificial Intelligence. Morgan Kaufmann (1993) 1022–1027
Fulton, T., Kasif, S., Salzberg, S.: Efficient algorithms for finding multi-way splits for decision trees. In: Machine Learning: Proc. 12th Intl. Conf. Morgan Kaufmann (1995) 244–251
John, G.: Robust decision trees: Removing outliers from data. In: Proc. 1st Intl. Conf. on Knowledge Discovery and Data Mining. AAAI Press (1995) 174–179
Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97 (1997) 273–324
Kononenko, I., Bratko, I., RoĹťkar, E.: Experiments in automatic learning of medical diagnostic rules. Tech. Rep. Josef Stefan Institute (1984)
López de Mà ntaras, R.: A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6 (1991) 81–92
Merz, C., Murphy, P.: UCI repository of machine learning databases. http://www.ics.uci.edu/tmlearn/MLRepository.html.
Oates, T., Jensen, D.: Large datasets lead to overly complex models: an explanation and a solution. In: Proc. 4th Intl. Conf. on Knowledge Discovery and Data Mining. AAAI Press (to appear)
Quinlan, R.: Learning efficient classification procedures and their application to chess end games. In: Michalski, R., Carbonell, J., Mitchell, T. (eds.): Machine Learning: An Artificial Intelligence Approach. Tioga (1983) 391–411
Quinlan, R.: Induction of decision trees. Mach. Learn. 1 (1986) 81–106
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elomaa, T., Rousu, J. (1998). Postponing the evaluation of attributes with a high number of boundary points. In: Żytkow, J.M., Quafafou, M. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1998. Lecture Notes in Computer Science, vol 1510. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0094823
Download citation
DOI: https://doi.org/10.1007/BFb0094823
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65068-3
Online ISBN: 978-3-540-49687-8
eBook Packages: Springer Book Archive