Skip to main content

Data Cloud for Distributed Data Mining via Pipelined MapReduce

  • Conference paper
Agents and Data Mining Interaction (ADMI 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7103))

Included in the following conference series:

Abstract

Distributed data mining (DDM) which often utilizes autonomous agents is a process to extract globally interesting associations, classifiers, clusters, and other patterns from distributed data. As datasets double in size every year, moving the data repeatedly to distant CPUs brings about high communication cost. In this paper, data cloud is utilized to implement DDM in order to move the data rather than moving computation. MapReduce is a popular programming model for implementing data-centric distributed computing. Firstly, a kind of cloud system architecture for DDM is proposed. Secondly, a modified MapReduce framework called pipelined MapReduce is presented. We select Apriori as a case study and discuss its implementation within MapReduce framework. Several experiments are conducted at last. Experimental results show that with moderate number of map tasks, the execution time of DDM algorithms (i.e., Apriori) can be reduced remarkably. Performance comparison between traditional and our pipelined MapReduce has shown that the map task and reduce task in our pipelined MapReduce can run in a parallel manner, and our pipelined MapReduce greatly decreases the execution time of DDM algorithm. Data cloud is suitable for a multitude of DDM algorithms and can provide significant speedups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cao, L., Gorodetsky, V., Mitkas, P.A.: Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)

    Article  Google Scholar 

  2. Pech, S., Goehner, P.: Multi-agent Information Retrieval in Heterogeneous Industrial Automation Environments. In: Cao, L., Bazzan, A.L.C., Gorodetsky, V., Mitkas, P.A., Weiss, G., Yu, P.S. (eds.) ADMI 2010. LNCS, vol. 5980, pp. 27–39. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  3. Yi, X., Zhang, Y.: Privacy-preserving naïve Bayes classification on distributed data via semi-trusted mixers. Information Systems 34(3), 371–380 (2009)

    Article  Google Scholar 

  4. Cao, L.: Domain-Driven Data Mining: Challenges and Prospects. IEEE Transactions on Knowledge and Data Engineering 22(6), 755–769 (2010)

    Article  Google Scholar 

  5. Grossman, R., Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 920–927 (2008)

    Google Scholar 

  6. Szalay, A., Bunn, A., Gray, J., Foster, I., Raicu, I.: The Importance of Data Locality in Distributed Computing Applications. In: NSF Workflow Workshop (2006)

    Google Scholar 

  7. Above the clouds: A Berkeley View of Cloud computing. UCB/EECS-2009-28 (2009)

    Google Scholar 

  8. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25(6), 599–616 (2009)

    Article  Google Scholar 

  9. Ralf, L.: Google’s MapReduce programming model - Revisited. The Journal of Science of Computer Programming 70(1), 1–30 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  10. Hadoop: The Apache Software Foundation, http://hadoop.apache.org/core

  11. Cao, L., Luo, D., Zhang, C.: Ubiquitous Intelligence in Agent Mining. In: Cao, L., Gorodetsky, V., Liu, J., Weiss, G., Yu, P.S. (eds.) ADMI 2009. LNCS, vol. 5680, pp. 23–35. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Fiolet, V., Toursel, B.: Distributed Data Mining. Scalable Computing: Practice and Experience 6(1), 99–109 (2005)

    Google Scholar 

  13. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)

    Article  Google Scholar 

  14. Hadoop, W.T.: The Definitive Guide. O’ Reilly Publishers (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wu, Z., Cao, J., Fang, C. (2012). Data Cloud for Distributed Data Mining via Pipelined MapReduce. In: Cao, L., Bazzan, A.L.C., Symeonidis, A.L., Gorodetsky, V.I., Weiss, G., Yu, P.S. (eds) Agents and Data Mining Interaction. ADMI 2011. Lecture Notes in Computer Science(), vol 7103. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27609-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27609-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27608-8

  • Online ISBN: 978-3-642-27609-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics