Skip to main content

An Executable Sequential Specification for Spark Aggregation

  • Conference paper
  • First Online:
Networked Systems (NETYS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 10299))

Included in the following conference series:

Abstract

Spark is a new promising platform for scalable data-parallel computation. It provides several high-level application programming interfaces (APIs) to perform parallel data aggregation. Since execution of parallel aggregation in Spark is inherently non-deterministic, a natural requirement for Spark programs is to give the same result for any execution on the same data set. We present PureSpark, an executable formal Haskell specification for Spark aggregate combinators. Our specification allows us to deduce the precise condition for deterministic outcomes from Spark aggregation. We report case studies analyzing deterministic outcomes and correctness of Spark programs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache Spark. https://github.com/apache/spark

  2. IBM DB2 Version 9.7. Partitioned Tables. https://ibm.biz/BdHyYR

  3. The Scalaz project. https://github.com/scalaz

  4. PureSpark. https://github.com/guluchen/purespark

  5. Bennett, J., Grout, R., Pebay, P., Roe, D., Thompson, D.: Numerically stable, single-pass, parallel statistics algorithms. In: CLUSTER, pp. 1–8 (2009)

    Google Scholar 

  6. Bird, R.S.: An introduction to the theory of lists. In: Broy, M. (eds) Logic of Programming and Calculi of Discrete Design. NATO ASI Series (Series F: Computer and Systems Sciences), vol. 36, pp. 5–42. Springer, Heidelberg (1987)

    Google Scholar 

  7. Bocchino Jr., R.L., Adve, V.S., Dig, D., Adve, S.V., Heumann, S., Komuravelli, R., Overbey, J., Simmons, P., Sung, H., Vakilian, M.: A type and effect system for deterministic parallel Java. In: OOPSLA, pp. 97–116 (2009)

    Google Scholar 

  8. Bocchino Jr., R.L., Heumann, S., Honarmand, N., Adve, S.V., Adve, V.S., Welc, A., Shpeisman, T.: Safe nondeterminism in a deterministic-by-default parallel language. SIGPLAN Not. 46(1), 535–548 (2011)

    Google Scholar 

  9. Budimlic, Z., Burke, M.G., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D.M., Sarkar, V., Schlimbach, F., Tasirlar, S.: Concurrent collections. Sci. Program. 18(3–4), 203–217 (2010)

    Google Scholar 

  10. Burnim, J., Sen, K.: Asserting and checking determinism for multithreaded programs. Commun. ACM 53(6), 97–105 (2010)

    Article  Google Scholar 

  11. Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS 1998 (1998)

    Google Scholar 

  12. Chen, Y., Hong, C., Lengál, O., Mu, S., Sinha, N., Wang, B.: An executable sequential specification for Spark aggregation arXiv:1702.02439 [cs.DC] (2017)

  13. Chen, Y.-F., Hong, C.-D., Sinha, N., Wang, B.-Y.: Commutativity of reducers. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 131–146. Springer, Heidelberg (2015). doi:10.1007/978-3-662-46681-0_9

    Google Scholar 

  14. Chu, C., Kim, S.K., Lin, Y., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-Reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)

    Google Scholar 

  15. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  16. Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurrency Comput. Pract. Experience 27(7), 1734–1766 (2015)

    Article  Google Scholar 

  17. Emoto, K., Fischer, S., Hu, Z.: Generate, test, and aggregate. In: Seidl, H. (ed.) ESOP 2012. LNCS, vol. 7211, pp. 254–273. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28869-2_13

    Chapter  Google Scholar 

  18. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endowment 4(11), 1111–1122 (2011)

    Google Scholar 

  19. Herodotou, H., Borisov, N., Babu, S.: Query optimization techniques for partitioned tables. In: SIGMOD 2011, pp. 49–60 (2011)

    Google Scholar 

  20. Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996)

    Article  Google Scholar 

  21. Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: SODA, pp. 938–948 (2010)

    Google Scholar 

  22. Leijen, D., Fähndrich, M., Burckhardt, S.: Prettier concurrency: Purely functional concurrent revisions. In: Haskell, pp. 83–94 (2011)

    Google Scholar 

  23. Liu, C., Zhang, J., Zhou, H., McDirmid, S., Guo, Z., Moscibroda, T.: Automating distributed partial aggregation. In: SoCC, pp. 1:1–1:12 (2014)

    Google Scholar 

  24. Radoi, C., Fink, S.J., Rabbah, R.M., Sridharan, M.: Translating imperative code to MapReduce. In: OOPSLA, pp. 909–927 (2014)

    Google Scholar 

  25. Sakr, S., Liu, A., Fayoumi, A.G.: The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11:1–11:44 (2013)

    Google Scholar 

  26. Tian, Y., Tatikonda, S., Reinwald, B.: Scalable and numerically stable descriptive statistics in SystemML. In: ICDE, pp. 1351–1359 (2012)

    Google Scholar 

  27. Xiao, T., Zhang, J., Zhou, H., Guo, Z., McDirmid, S., Lin, W., Chen, W., Zhou, L.: Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs. In: Companion Proceedings of ICSE, pp. 44–53 (2014)

    Google Scholar 

  28. Xu, Z., Hirzel, M., Rothermel, G.: Semantic characterization of MapReduce workloads. In: IISWC, pp. 87–97 (2013)

    Google Scholar 

  29. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

    Google Scholar 

  30. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  31. Zhang, Z., Cherkasova, L., Verma, A., Loo, B.T.: Performance modeling and optimization of deadline-driven Pig programs. ACM Trans. Auton. Adapt. Syst. 8(3), 14:1–14:28 (2013)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the Czech Science Foundation (project 17-12465S), the BUT FIT project FIT-S-17-4014, the IT4IXS: IT4Innovations Excellence in Science project (LQ1602), and Ministry of Science and Technology, R.O.C. (MOST projects 103-2221-E-001-019-MY3 and 103-2221-E-001-020-MY3).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ondřej Lengál .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chen, YF., Hong, CD., Lengál, O., Mu, SC., Sinha, N., Wang, BY. (2017). An Executable Sequential Specification for Spark Aggregation. In: El Abbadi, A., Garbinato, B. (eds) Networked Systems. NETYS 2017. Lecture Notes in Computer Science(), vol 10299. Springer, Cham. https://doi.org/10.1007/978-3-319-59647-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59647-1_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59646-4

  • Online ISBN: 978-3-319-59647-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics