Accelerating Iterative Big Data Computing Through MPI

Liang, Fan; Lu, Xiaoyi

doi:10.1007/s11390-015-1522-5

Accelerating Iterative Big Data Computing Through MPI

Regular Paper
Published: 13 March 2015

Volume 30, pages 283–294, (2015)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Fan Liang^1,2 &
Xiaoyi Lu³

291 Accesses
14 Citations
Explore all metrics

Abstract

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation of Big Data Frameworks: MapReduce and Spark

Performance Analysis of MapReduce-Based Distributed Systems for Iterative Data Processing Applications

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

Article 27 March 2017

References

Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report, 1999–66, Stanford InfoLab, Nov. 1999.
MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp.281–297.
Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107–113.
Article Google Scholar
Lam C. Hadoop in Action. New Jersey, USA: Manning Publications Co., 2010.
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI2012), April 2012, pp.2:1–2:14.
Lu X, Islam N S,Wasi-Ur-Rahman M, Jose J, Subramoni H, Wang H, Panda D K. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proc. the 42nd International Conference on Parallel Processing (ICPP2013), October 2013, pp.641–650.
Rahman M, Islam N, Lu X, Jose J, Subramoni H, Wang H, Panda D. High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In Proc. the 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW2013), May 2013, pp.1908–1917.
Rahman M, Lu X, Islam N S, Panda D K D. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In Proc. the 28th ACM International Conference on Supercomputing (ICS2014), December 2014, pp.33–42.
Lu X, Rahman M, Islam N, Shankar D, Panda D K. Accelerating spark with RDMA for big data processing: Early experiences. In Proc. the 22nd Annual Symposium on High-Performance Interconnects (HOTI2014), August 2014, pp. 9–16.
Plimpton S J, Devine K D. MapReduce in MPI for largescale graph algorithms. Parallel Comput., 2011, 37(9): 610–632.
Article Google Scholar
Lu X, Liang F, Wang B, Zha L, Xu Z. DataMPI: Extending MPI to Hadoop-like big data computing. In Proc. the 28th International Parallel and Distributed Processing Symposium (IPDPS2014), May 2014, pp.829–838.
Lu X, Wang B, Zha L, Xu Z. Can MPI benefit Hadoop and MapReduce applications? In Proc. the 40th International Conference on Parallel Processing Workshops (ICPPW2011), September 2011, pp.371–379.
Liang F, Feng C, Lu X, Xu Z. Performance benefits of DataMPI: A case study with BigDataBench. In Lecture Notes in Computer Science, 8807, Zhan J, Han R, Weng C (eds.), Springer International Publishing, 2014, pp.111–123.
Liang F, Feng C, Lu X, Xu Z. Performance characterization of Hadoop and data MPI based on Amdahl’s second law. In Proc. the 9th IEEE International Conference on Networking, Architecture, and Storage (NAS2014), August 2014, pp.207–215.
Bu Y, Howe B, Balazinska M, Ernst M D. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endowment, 2010, 3(1/2): 285–296.
Article Google Scholar
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S H, Qiu J, Fox G. Twister: A runtime for iterative MapReduce. In Proc. the 19th International Symposium on High Performance Distributed Computing (HPDC2010), June 2010, pp.810–818.
Zhang Y, Gao Q, Gao L, Wang C. iMapReduce: A distributed computing framework for iterative computation. In Proc. the 25th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW2011), May 2011, pp.1112–1121.
Kang U, Tsourakakis C E, Faloutsos C. PEGASUS: A petascale graph mining system implementation and observations. In Proc. the 9th International Conference on Data Mining (ICDM2009), December 2009, pp.229–238.
Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: Characterization of the MapReducebased data analysis. In Proc. the 26th International Conference on Data Engineering Workshops (ICDEW2010), March 2010, pp.41–51.
Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B. BigDataBench: A big data benchmark suite from internet services. In Proc. the 20th International Symposium on High Performance Computer Architecture (HPCA2014), February 2014, pp.488–499.
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 2010, 11(Feb): 985–1042.
MATH MathSciNet Google Scholar
Johnson B T. Dstat: Software for the Meta-Analytic Review of Research Literatures. Lawrence Erlbaum ASSOC Inc., 1989.
Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29–123.
Article MATH MathSciNet Google Scholar
Backstrom L, Huttenlocher D, Kleinberg J, Lan X. Group formation in large social networks: Membership, growth, and evolution. In Proc. the 12th International Conference on Knowledge Discovery and Data Mining (KDD2006), August 2006, pp.44–54.
Power R, Li J. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI2010), October 2010, pp.293–306.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Fan Liang
University of Chinese Academy of Sciences, Beijing, 100049, China
Fan Liang
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277, U.S.A.
Xiaoyi Lu

Authors

Fan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyi Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan Liang.

Additional information

Special Section on Applications and Industry

This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDA06010401, the National High Technology Research and Development 863 Program of China under Grant Nos. 2013AA01A209, 2013AA01A213, and the Guangdong Talents Program of China under Grant No. 201001D0104726115.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, F., Lu, X. Accelerating Iterative Big Data Computing Through MPI. J. Comput. Sci. Technol. 30, 283–294 (2015). https://doi.org/10.1007/s11390-015-1522-5

Download citation

Received: 18 November 2014
Revised: 26 January 2015
Published: 13 March 2015
Issue Date: March 2015
DOI: https://doi.org/10.1007/s11390-015-1522-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Iterative Big Data Computing Through MPI

Abstract

Access this article

Similar content being viewed by others

Performance Evaluation of Big Data Frameworks: MapReduce and Spark

Performance Analysis of MapReduce-Based Distributed Systems for Iterative Data Processing Applications

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating Iterative Big Data Computing Through MPI

Abstract

Access this article

Similar content being viewed by others

Performance Evaluation of Big Data Frameworks: MapReduce and Spark

Performance Analysis of MapReduce-Based Distributed Systems for Iterative Data Processing Applications

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation