Skip to main content

Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark

  • Conference paper
  • First Online:
Advanced Computational Methods for Knowledge Engineering (ICCSAMA 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1121))

  • 641 Accesses

Abstract

MapReduce has become the dominant programming model for analyzing and processing large-scale data. However, the model has its own limitations. It does not completely support iterative computation, caching mechanism, and operations with multiple inputs. Besides, I/O and communication costs of the model are so expensive. One of the most notably complex operations extensively and expensively used in MapReduce is recursive joins. It requires processing characteristics that are the limitations of a MapReduce environment. Therefore, this research proposes efficient solutions for processing recursive joins in Spark, a next-generation data processing engine of MapReduce. Our proposal eliminates a large amount of redundant data generated in repeated join steps and takes advantages of in-memory computing means and cache mechanism. Through experiments, the present research shows that our solutions significantly improve the execution performance of recursive joins on large-scale datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107 (2008)

    Article  Google Scholar 

  2. Apache Hive TM. https://hive.apache.org/. Accessed 14 Jun 2016

  3. Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical Image Processing with Hadoop. ResearchGate, July 2011

    Google Scholar 

  4. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. ResearchGate, January 1998

    Google Scholar 

  5. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  Google Scholar 

  6. Bancilhon, F., Ramakrishnan, R.: An amateurs introduction to recursive query processing strategies (1986)

    Article  Google Scholar 

  7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review (1999)

    Article  Google Scholar 

  8. Hagan, M.T., Demuth, H.B., Beale, M.H., De Jess, O.: Neural Network Design, 2nd edn. Martin Hagan, Atlanta (2014)

    Google Scholar 

  9. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)

    Book  Google Scholar 

  10. Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, pp. 50–60 (2005)

    Google Scholar 

  11. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)

    Article  Google Scholar 

  12. Codd, E.F.: Relational completeness of data base sublanguages. In: Rustin, R. (ed) Database System, pp. 65–98. Prentice Hall (1972). IBM Research report RJ 987 San Jose, California

    Google Scholar 

  13. Tan, K.-L., Lu, H.: A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec. 20(4), 81–82 (1991)

    Article  Google Scholar 

  14. Lin, X., Orlowska, M.E.: An efficient processing of a chain join with the minimum communication cost in distributed database systems. Distrib. Parallel Databases 3(1), 69–83 (1995)

    Article  Google Scholar 

  15. Ordonez, C.: Optimizing recursive queries in SQL. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 834–839 (2005)

    Google Scholar 

  16. Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, New York, NY, USA pp. 594–605 (2008)

    Google Scholar 

  17. Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in mapreduce. In: Hameurlain, A., Kung, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV, pp. 33–70. Springer, Berlin Heidelberg (2016)

    Chapter  Google Scholar 

  18. Apache SparkTM - Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 14 Jun 2016

  19. The Apache Cassandra Project. http://cassandra.apache.org/. Accessed 14 Jun 2016

  20. Apache HBase - Apache HBaseTM Home. https://hbase.apache.org/. Accessed 14 Jun 2016

  21. Amazon Simple Storage Service (S3) - Cloud Storage. https://aws.amazon.com/s3/. Accessed 14 Jun 2016

  22. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, Berkeley, CA, USA, p. 2 (2012)

    Google Scholar 

  23. Bancilhon, F.: Naive evaluation of recursively defined relations. In: Brodie, M.L., Mylopoulos, J. (eds.) On Knowledge Base Management Systems, pp. 165–178. Springer, New York (1986)

    Chapter  Google Scholar 

  24. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press Inc., New York (1988)

    Google Scholar 

  25. Ioannidis, Y.E.: On the computation of the transitive closure of relational operators. In: Proceedings of the 12th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 403–411 (1986)

    Google Scholar 

  26. Warshall, S.: A theorem on boolean matrices. J. ACM 9(1), 11–12 (1962)

    Article  MathSciNet  Google Scholar 

  27. Warren Jr., H.S.: A modification of warshall’s algorithm for the transitive closure of binary relations. Commun. ACM 18(4), 218–220 (1975)

    Article  MathSciNet  Google Scholar 

  28. Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the 14th International Conference on Extending Database Technology, New York, NY, USA, pp. 1–8 (2011)

    Google Scholar 

  29. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)

    Article  Google Scholar 

  30. Tran, T.T.Q.: Traitement de la jointure recursive en MapReduce. Universite Blaise Pascal-Clermont-Ferrand II, Clermont-Ferrand (2014)

    Google Scholar 

  31. Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in hadoop. In: Proceedings of the Second International Conference on Datalog in Academia and Industry, Berlin, Heidelberg, pp. 165–176 (2012)

    Google Scholar 

  32. Phan, T.-C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, New York, NY, USA, p. 2:1–2:2 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thuong-Cang Phan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Phan, TC., Phan, AC., Tran, TTQ., Trieu, NT. (2020). Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark. In: Le Thi, H., Le, H., Pham Dinh, T., Nguyen, N. (eds) Advanced Computational Methods for Knowledge Engineering. ICCSAMA 2019. Advances in Intelligent Systems and Computing, vol 1121. Springer, Cham. https://doi.org/10.1007/978-3-030-38364-0_35

Download citation

Publish with us

Policies and ethics