An Executable Sequential Specification for Spark Aggregation

Chen, Yu-Fang; Hong, Chih-Duo; Lengál, Ondřej; Mu, Shin-Cheng; Sinha, Nishant; Wang, Bow-Yaw

doi:10.1007/978-3-319-59647-1_31

Yu-Fang Chen¹⁵,
Chih-Duo Hong¹⁵,
Ondřej Lengál^15,16,
Shin-Cheng Mu¹⁵,
Nishant Sinha¹⁷ &
…
Bow-Yaw Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 10299))

Included in the following conference series:

International Conference on Networked Systems

813 Accesses
8 Citations

Abstract

Spark is a new promising platform for scalable data-parallel computation. It provides several high-level application programming interfaces (APIs) to perform parallel data aggregation. Since execution of parallel aggregation in Spark is inherently non-deterministic, a natural requirement for Spark programs is to give the same result for any execution on the same data set. We present PureSpark, an executable formal Haskell specification for Spark aggregate combinators. Our specification allows us to deduce the precise condition for deterministic outcomes from Spark aggregation. We report case studies analyzing deterministic outcomes and correctness of Spark programs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Spark. https://github.com/apache/spark
IBM DB2 Version 9.7. Partitioned Tables. https://ibm.biz/BdHyYR
The Scalaz project. https://github.com/scalaz
PureSpark. https://github.com/guluchen/purespark
Bennett, J., Grout, R., Pebay, P., Roe, D., Thompson, D.: Numerically stable, single-pass, parallel statistics algorithms. In: CLUSTER, pp. 1–8 (2009)
Google Scholar
Bird, R.S.: An introduction to the theory of lists. In: Broy, M. (eds) Logic of Programming and Calculi of Discrete Design. NATO ASI Series (Series F: Computer and Systems Sciences), vol. 36, pp. 5–42. Springer, Heidelberg (1987)
Google Scholar
Bocchino Jr., R.L., Adve, V.S., Dig, D., Adve, S.V., Heumann, S., Komuravelli, R., Overbey, J., Simmons, P., Sung, H., Vakilian, M.: A type and effect system for deterministic parallel Java. In: OOPSLA, pp. 97–116 (2009)
Google Scholar
Bocchino Jr., R.L., Heumann, S., Honarmand, N., Adve, S.V., Adve, V.S., Welc, A., Shpeisman, T.: Safe nondeterminism in a deterministic-by-default parallel language. SIGPLAN Not. 46(1), 535–548 (2011)
Google Scholar
Budimlic, Z., Burke, M.G., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D.M., Sarkar, V., Schlimbach, F., Tasirlar, S.: Concurrent collections. Sci. Program. 18(3–4), 203–217 (2010)
Google Scholar
Burnim, J., Sen, K.: Asserting and checking determinism for multithreaded programs. Commun. ACM 53(6), 97–105 (2010)
Article Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS 1998 (1998)
Google Scholar
Chen, Y., Hong, C., Lengál, O., Mu, S., Sinha, N., Wang, B.: An executable sequential specification for Spark aggregation arXiv:1702.02439 [cs.DC] (2017)
Chen, Y.-F., Hong, C.-D., Sinha, N., Wang, B.-Y.: Commutativity of reducers. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 131–146. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46681-0_9
Google Scholar
Chu, C., Kim, S.K., Lin, Y., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-Reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurrency Comput. Pract. Experience 27(7), 1734–1766 (2015)
Article Google Scholar
Emoto, K., Fischer, S., Hu, Z.: Generate, test, and aggregate. In: Seidl, H. (ed.) ESOP 2012. LNCS, vol. 7211, pp. 254–273. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28869-2_13
Chapter Google Scholar
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endowment 4(11), 1111–1122 (2011)
Google Scholar
Herodotou, H., Borisov, N., Babu, S.: Query optimization techniques for partitioned tables. In: SIGMOD 2011, pp. 49–60 (2011)
Google Scholar
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996)
Article Google Scholar
Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: SODA, pp. 938–948 (2010)
Google Scholar
Leijen, D., Fähndrich, M., Burckhardt, S.: Prettier concurrency: Purely functional concurrent revisions. In: Haskell, pp. 83–94 (2011)
Google Scholar
Liu, C., Zhang, J., Zhou, H., McDirmid, S., Guo, Z., Moscibroda, T.: Automating distributed partial aggregation. In: SoCC, pp. 1:1–1:12 (2014)
Google Scholar
Radoi, C., Fink, S.J., Rabbah, R.M., Sridharan, M.: Translating imperative code to MapReduce. In: OOPSLA, pp. 909–927 (2014)
Google Scholar
Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11:1–11:44 (2013)
Google Scholar
Tian, Y., Tatikonda, S., Reinwald, B.: Scalable and numerically stable descriptive statistics in SystemML. In: ICDE, pp. 1351–1359 (2012)
Google Scholar
Xiao, T., Zhang, J., Zhou, H., Guo, Z., McDirmid, S., Lin, W., Chen, W., Zhou, L.: Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In: Companion Proceedings of ICSE, pp. 44–53 (2014)
Google Scholar
Xu, Z., Hirzel, M., Rothermel, G.: Semantic characterization of MapReduce workloads. In: IISWC, pp. 87–97 (2013)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
Google Scholar
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Zhang, Z., Cherkasova, L., Verma, A., Loo, B.T.: Performance modeling and optimization of deadline-driven Pig programs. ACM Trans. Auton. Adapt. Syst. 8(3), 14:1–14:28 (2013)
Google Scholar

Download references

Acknowledgement

This work was supported by the Czech Science Foundation (project 17-12465S), the BUT FIT project FIT-S-17-4014, the IT4IXS: IT4Innovations Excellence in Science project (LQ1602), and Ministry of Science and Technology, R.O.C. (MOST projects 103-2221-E-001-019-MY3 and 103-2221-E-001-020-MY3).

Author information

Authors and Affiliations

Academia Sinica, Taipei, Taiwan
Yu-Fang Chen, Chih-Duo Hong, Ondřej Lengál, Shin-Cheng Mu & Bow-Yaw Wang
Brno University of Technology, Brno, Czech Republic
Ondřej Lengál
IBM Research, New Delhi, India
Nishant Sinha

Authors

Yu-Fang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Duo Hong
View author publications
You can also search for this author in PubMed Google Scholar
Ondřej Lengál
View author publications
You can also search for this author in PubMed Google Scholar
Shin-Cheng Mu
View author publications
You can also search for this author in PubMed Google Scholar
Nishant Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Bow-Yaw Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ondřej Lengál .

Editor information

Editors and Affiliations

Department of Computer Science, University of California, Santa Barbara, Santa Barbara, California, USA
Amr El Abbadi
Université de Lausanne, Lausanne, Switzerland
Benoît Garbinato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, YF., Hong, CD., Lengál, O., Mu, SC., Sinha, N., Wang, BY. (2017). An Executable Sequential Specification for Spark Aggregation. In: El Abbadi, A., Garbinato, B. (eds) Networked Systems. NETYS 2017. Lecture Notes in Computer Science(), vol 10299. Springer, Cham. https://doi.org/10.1007/978-3-319-59647-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-59647-1_31
Published: 14 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59646-4
Online ISBN: 978-3-319-59647-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics