Data Cloud for Distributed Data Mining via Pipelined MapReduce

Wu, Zhiang; Cao, Jie; Fang, Changjian

doi:10.1007/978-3-642-27609-5_20

Zhiang Wu²⁵,
Jie Cao²⁵ &
Changjian Fang²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7103))

Included in the following conference series:

International Workshop on Agents and Data Mining Interaction

1093 Accesses
1 Citations

Abstract

Distributed data mining (DDM) which often utilizes autonomous agents is a process to extract globally interesting associations, classifiers, clusters, and other patterns from distributed data. As datasets double in size every year, moving the data repeatedly to distant CPUs brings about high communication cost. In this paper, data cloud is utilized to implement DDM in order to move the data rather than moving computation. MapReduce is a popular programming model for implementing data-centric distributed computing. Firstly, a kind of cloud system architecture for DDM is proposed. Secondly, a modified MapReduce framework called pipelined MapReduce is presented. We select Apriori as a case study and discuss its implementation within MapReduce framework. Several experiments are conducted at last. Experimental results show that with moderate number of map tasks, the execution time of DDM algorithms (i.e., Apriori) can be reduced remarkably. Performance comparison between traditional and our pipelined MapReduce has shown that the map task and reduce task in our pipelined MapReduce can run in a parallel manner, and our pipelined MapReduce greatly decreases the execution time of DDM algorithm. Data cloud is suitable for a multitude of DDM algorithms and can provide significant speedups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cao, L., Gorodetsky, V., Mitkas, P.A.: Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
Article Google Scholar
Pech, S., Goehner, P.: Multi-agent Information Retrieval in Heterogeneous Industrial Automation Environments. In: Cao, L., Bazzan, A.L.C., Gorodetsky, V., Mitkas, P.A., Weiss, G., Yu, P.S. (eds.) ADMI 2010. LNCS, vol. 5980, pp. 27–39. Springer, Heidelberg (2010)
Chapter Google Scholar
Yi, X., Zhang, Y.: Privacy-preserving naïve Bayes classification on distributed data via semi-trusted mixers. Information Systems 34(3), 371–380 (2009)
Article Google Scholar
Cao, L.: Domain-Driven Data Mining: Challenges and Prospects. IEEE Transactions on Knowledge and Data Engineering 22(6), 755–769 (2010)
Article Google Scholar
Grossman, R., Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 920–927 (2008)
Google Scholar
Szalay, A., Bunn, A., Gray, J., Foster, I., Raicu, I.: The Importance of Data Locality in Distributed Computing Applications. In: NSF Workflow Workshop (2006)
Google Scholar
Above the clouds: A Berkeley View of Cloud computing. UCB/EECS-2009-28 (2009)
Google Scholar
Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25(6), 599–616 (2009)
Article Google Scholar
Ralf, L.: Google’s MapReduce programming model - Revisited. The Journal of Science of Computer Programming 70(1), 1–30 (2008)
Article MathSciNet MATH Google Scholar
Hadoop: The Apache Software Foundation, http://hadoop.apache.org/core
Cao, L., Luo, D., Zhang, C.: Ubiquitous Intelligence in Agent Mining. In: Cao, L., Gorodetsky, V., Liu, J., Weiss, G., Yu, P.S. (eds.) ADMI 2009. LNCS, vol. 5680, pp. 23–35. Springer, Heidelberg (2009)
Chapter Google Scholar
Fiolet, V., Toursel, B.: Distributed Data Mining. Scalable Computing: Practice and Experience 6(1), 99–109 (2005)
Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)
Article Google Scholar
Hadoop, W.T.: The Definitive Guide. O’ Reilly Publishers (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Jiangsu Provincial Key Laboratory of E-Business, Nanjing University of Finance and Economics, Nanjing, P.R. China
Zhiang Wu, Jie Cao & Changjian Fang

Authors

Zhiang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Cao
View author publications
You can also search for this author in PubMed Google Scholar
Changjian Fang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, PO Box 123, 2007, Sydney, NSW, Australia
Longbing Cao
Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Caixa Postal 15064, 91.501-970, Porto Alegre, RS, Brazil
Ana L. C. Bazzan
Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
Andreas L. Symeonidis
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, 39, 14th Liniya, 199178, St. Petersburg, Russia
Vladimir I. Gorodetsky
Department of Knowledge Engineering, Maastricht University, P.O. Box 616, 6200, Maastricht, MD, The Netherlands
Gerhard Weiss
Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Room 1138 SEO, 60607, Chicago, IL, USA
Philip S. Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Z., Cao, J., Fang, C. (2012). Data Cloud for Distributed Data Mining via Pipelined MapReduce. In: Cao, L., Bazzan, A.L.C., Symeonidis, A.L., Gorodetsky, V.I., Weiss, G., Yu, P.S. (eds) Agents and Data Mining Interaction. ADMI 2011. Lecture Notes in Computer Science(), vol 7103. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27609-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-27609-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27608-8
Online ISBN: 978-3-642-27609-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics