Skip to main content
Log in

Accelerating Iterative Big Data Computing Through MPI

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report, 1999–66, Stanford InfoLab, Nov. 1999.

  2. MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp.281–297.

  3. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107–113.

    Article  Google Scholar 

  4. Lam C. Hadoop in Action. New Jersey, USA: Manning Publications Co., 2010.

    Google Scholar 

  5. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI2012), April 2012, pp.2:1–2:14.

  6. Lu X, Islam N S,Wasi-Ur-Rahman M, Jose J, Subramoni H, Wang H, Panda D K. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proc. the 42nd International Conference on Parallel Processing (ICPP2013), October 2013, pp.641–650.

  7. Rahman M, Islam N, Lu X, Jose J, Subramoni H, Wang H, Panda D. High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In Proc. the 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW2013), May 2013, pp.1908–1917.

  8. Rahman M, Lu X, Islam N S, Panda D K D. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In Proc. the 28th ACM International Conference on Supercomputing (ICS2014), December 2014, pp.33–42.

  9. Lu X, Rahman M, Islam N, Shankar D, Panda D K. Accelerating spark with RDMA for big data processing: Early experiences. In Proc. the 22nd Annual Symposium on High-Performance Interconnects (HOTI2014), August 2014, pp. 9–16.

  10. Plimpton S J, Devine K D. MapReduce in MPI for largescale graph algorithms. Parallel Comput., 2011, 37(9): 610–632.

    Article  Google Scholar 

  11. Lu X, Liang F, Wang B, Zha L, Xu Z. DataMPI: Extending MPI to Hadoop-like big data computing. In Proc. the 28th International Parallel and Distributed Processing Symposium (IPDPS2014), May 2014, pp.829–838.

  12. Lu X, Wang B, Zha L, Xu Z. Can MPI benefit Hadoop and MapReduce applications? In Proc. the 40th International Conference on Parallel Processing Workshops (ICPPW2011), September 2011, pp.371–379.

  13. Liang F, Feng C, Lu X, Xu Z. Performance benefits of DataMPI: A case study with BigDataBench. In Lecture Notes in Computer Science, 8807, Zhan J, Han R, Weng C (eds.), Springer International Publishing, 2014, pp.111–123.

  14. Liang F, Feng C, Lu X, Xu Z. Performance characterization of Hadoop and data MPI based on Amdahl’s second law. In Proc. the 9th IEEE International Conference on Networking, Architecture, and Storage (NAS2014), August 2014, pp.207–215.

  15. Bu Y, Howe B, Balazinska M, Ernst M D. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endowment, 2010, 3(1/2): 285–296.

    Article  Google Scholar 

  16. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S H, Qiu J, Fox G. Twister: A runtime for iterative MapReduce. In Proc. the 19th International Symposium on High Performance Distributed Computing (HPDC2010), June 2010, pp.810–818.

  17. Zhang Y, Gao Q, Gao L, Wang C. iMapReduce: A distributed computing framework for iterative computation. In Proc. the 25th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW2011), May 2011, pp.1112–1121.

  18. Kang U, Tsourakakis C E, Faloutsos C. PEGASUS: A petascale graph mining system implementation and observations. In Proc. the 9th International Conference on Data Mining (ICDM2009), December 2009, pp.229–238.

  19. Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: Characterization of the MapReducebased data analysis. In Proc. the 26th International Conference on Data Engineering Workshops (ICDEW2010), March 2010, pp.41–51.

  20. Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B. BigDataBench: A big data benchmark suite from internet services. In Proc. the 20th International Symposium on High Performance Computer Architecture (HPCA2014), February 2014, pp.488–499.

  21. Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 2010, 11(Feb): 985–1042.

    MATH  MathSciNet  Google Scholar 

  22. Johnson B T. Dstat: Software for the Meta-Analytic Review of Research Literatures. Lawrence Erlbaum ASSOC Inc., 1989.

  23. Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29–123.

    Article  MATH  MathSciNet  Google Scholar 

  24. Backstrom L, Huttenlocher D, Kleinberg J, Lan X. Group formation in large social networks: Membership, growth, and evolution. In Proc. the 12th International Conference on Knowledge Discovery and Data Mining (KDD2006), August 2006, pp.44–54.

  25. Power R, Li J. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI2010), October 2010, pp.293–306.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Liang.

Additional information

Special Section on Applications and Industry

This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDA06010401, the National High Technology Research and Development 863 Program of China under Grant Nos. 2013AA01A209, 2013AA01A213, and the Guangdong Talents Program of China under Grant No. 201001D0104726115.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, F., Lu, X. Accelerating Iterative Big Data Computing Through MPI. J. Comput. Sci. Technol. 30, 283–294 (2015). https://doi.org/10.1007/s11390-015-1522-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-015-1522-5

Keywords

Navigation