Abstract
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, London (1966)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Cendrowska, J.: PRISM: an Algorithm for Inducing Modular Rules. International Journal of Man-Machine Studies 27, 349–370 (1987)
Shu-Ching, C., Mei-Ling, S., Schengcui, Z.: Detection of Soccer Goal Shots Using Joint Multimedia Features and classification Rules. In: Fourth International Workshop on Multimedia Data Mining, Washington, DC, USA, pp. 36–44 (2003)
Bramer, M.: An Information-Theoretic Approach to the Pre-pruning of Classification Rules. In: Proceedings of the IFIP Seventeenth World Computer Congress - TC12 Stream on Intelligent Information Processing, pp. 201–212. Kluwer, B.V., Dordrecht (2002)
Bramer, M.: Automatic Induction of Classification Rules from Examples Using N-Prism. In: Research and Development in Intelligent Systems XVI (2000)
Garner, S.: Weka: The Waikato Environment for Knowledge Analysis. In: New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
Bramer, M.: Inducer: a public domain workbench for data mining. International Journal of Systems Science 36(14), 909–919 (2005)
Metha, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Fcalable Classier for Data Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057. Springer, Heidelberg (1996)
Shafer, J.C., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. In: Twenty-second International Conference on Very Large Data Bases (1996)
Catlett, J.: Megainduction: Machine learning on very large databases. University of Technology, Sydney (1991)
Frey, L.J., Fisher, D.H.: Modelling Decision Tree Performance with the Power Law. In: Evelenth International Workshop on Artificial Intelligence and Statistics (1999)
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: Geoffrey, I. (ed.) Knowledge Discovery and Data Mining, pp. 23–32 (1999)
Chan, P.K., Stolfo, S.J.: Experiments on Multistrategy Learning by Meta Learning. In: Second International Conference on Information and Knowledge Management, pp. 314–323 (1993)
Chan, P.K., Stolfo, S.J.: Meta-Learning for Multistrategy and Parallel Learning. In: Second International Workshop on Multistrategy Learning, pp. 150–165 (1993)
Michalski, R.S.: On the quasi-minimal solution of the general covering problem. In: Proceedings of the Fifth International Symposium on Information Processing, Bled, Yugoslavia, pp. 125–128 (1969)
Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel Classification for Data Mining on Shared Memory Multiprocessors. In: Fifteenth International conference on Data Mining (1999)
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences (1998)
Provost, F.: Distributed Data Mining: Scaling up and Beyond. In: Kargupta, P.C.H. (ed.) Advances in Distributed and Parallel Knowledge Discovery. AAAI Press / The MIT Press (2000)
Kamath, C., Musik, R.: Scalable Data Mining through Fine-Grained Parallelism. In: Kargupta, P.C.H. (ed.) Advances in Distributed and Parallel Knowledge Discovery. AAAI Press / The MIT Press (2000)
Stahl, F., Bramer, M.: P-Prism: A Computationally Efficient Approach to Scaling up Classification Rule Induction. In: IFIP International Conference on Artificial Intelligence. Springer, Milan (2008)
Nolle, L., Wong, K.C.P., Hopgood, A.: DARBS: A Distributed Blackboard System. In: Twenty-first SGES International Conference on Knowledge Based Systems (2001)
Stahl, F., Bramer, M.: Parallel Induction of Modular Classification Rules. In: Twenty-eighth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, Cambridge (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stahl, F., Bramer, M., Adda, M. (2009). PMCRI: A Parallel Modular Classification Rule Induction Framework. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-03070-3_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)