Abstract
Decision tree induction algorithms scale well to large datasets for their univariate and divide-and-conquer approach. However, they may fail in discovering effective knowledge when the input dataset consists of a large number of uncorrelated many-valued attributes. In this paper we present an algorithm, Noah, that tackles this problem by applying a multivariate search. Performing a multivariate search leads to a much larger consumption of computation time and memory, this may be prohibitive for large datasets. We remedy this problem by exploiting effective pruning strategies and efficient data structures. We applied our algorithm to a real marketing application of cross-selling. Experimental results revealed that the application database was too complex for C4.5 as it failed to discover any useful knowledge. The application database was also too large for various well known rule discovery algorithms which were not able to complete their task. The pruning techniques used in Noah are general in nature and can be used in other mining systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press, 1996.
R.J. Bayardo. Brute-force mining of high-confidence classification rules. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97). AAAI Press, 1997.
R.J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, dense databases. In Proc. of the 15th Int’l Conf. on Data Engineering, pages 188–197, 1999.
P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261–283, 1989.
W.W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence AAAI-96. AAAI press/ The MIT press, August 1996.
L. G. Cooper and G. Giuffrida. Turning datamining into a management science tool: New algorithms and empirical results. Management Science, 2000 (To appear).
P. Domingos. Linear-time rule induction. In E. Simoudis, J. W. Han, and U. Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), page 96. AAAI Press, 1996.
G. Giuffrida, L. G. Cooper, and W. W. Chu. A scalable bottom-up data mining algorithm for relational databases. In 10th International Conference on Scientific and Statistical Database Management (SSDBM’ 98), Capri, Italy, July 1998. IEEE Publisher.
R.C. Holte, L.E. Acker, and B.W. Porter. Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Arti_cial Intelligence, Detroit, (MI), 1989. Morgan Kaufmann.
B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proceedings of the Thirteenth International Conference on Data Engineering (ICDE’ 97), Birmingham, UK, 1997.
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In R. Agrawal, P. Storloz, and G. Piatetsky-Shapiro, editors, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), page 80. AAAI Press, 1998.
H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1, November 1997.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. Lecture Notes in Computer Science, 1057, 1996.
Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, California, 1988.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda, editors, VLDB 1996, Mumbai (Bombay), India, September 1996. Morgan Kaufmann.
M. Wang, B. Iyer, and J. S. Vitter. Scalable mining for classification rules in relational databases. In Proceedings of International Database Engineering and Application Symposium (IDEAS’98), Cardiff, Wales, U.K., July 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Giuffrida, G., Chu, W.W., Hanssens, D.M. (2000). Mining Classification Rules from Datasets with Large Number of Many-Valued Attributes. In: Zaniolo, C., Lockemann, P.C., Scholl, M.H., Grust, T. (eds) Advances in Database Technology — EDBT 2000. EDBT 2000. Lecture Notes in Computer Science, vol 1777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46439-5_23
Download citation
DOI: https://doi.org/10.1007/3-540-46439-5_23
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67227-2
Online ISBN: 978-3-540-46439-6
eBook Packages: Springer Book Archive