Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark

Phan, Thuong-Cang; Phan, Anh-Cang; Tran, Thi-To-Quyen; Trieu, Ngoan-Thanh

doi:10.1007/978-3-030-38364-0_35

Thuong-Cang Phan¹⁸,
Anh-Cang Phan¹⁹,
Thi-To-Quyen Tran²⁰ &
…
Ngoan-Thanh Trieu¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1121))

Included in the following conference series:

International Conference on Computer Science, Applied Mathematics and Applications

641 Accesses

Abstract

MapReduce has become the dominant programming model for analyzing and processing large-scale data. However, the model has its own limitations. It does not completely support iterative computation, caching mechanism, and operations with multiple inputs. Besides, I/O and communication costs of the model are so expensive. One of the most notably complex operations extensively and expensively used in MapReduce is recursive joins. It requires processing characteristics that are the limitations of a MapReduce environment. Therefore, this research proposes efficient solutions for processing recursive joins in Spark, a next-generation data processing engine of MapReduce. Our proposal eliminates a large amount of redundant data generated in repeated join steps and takes advantages of in-memory computing means and cache mechanism. Through experiments, the present research shows that our solutions significantly improve the execution performance of recursive joins on large-scale datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107 (2008)
Article Google Scholar
Apache Hive TM. https://hive.apache.org/. Accessed 14 Jun 2016
Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical Image Processing with Hadoop. ResearchGate, July 2011
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. ResearchGate, January 1998
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet Google Scholar
Bancilhon, F., Ramakrishnan, R.: An amateurs introduction to recursive query processing strategies (1986)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review (1999)
Article Google Scholar
Hagan, M.T., Demuth, H.B., Beale, M.H., De Jess, O.: Neural Network Design, 2nd edn. Martin Hagan, Atlanta (2014)
Google Scholar
Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)
Book Google Scholar
Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, pp. 50–60 (2005)
Google Scholar
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)
Article Google Scholar
Codd, E.F.: Relational completeness of data base sublanguages. In: Rustin, R. (ed) Database System, pp. 65–98. Prentice Hall (1972). IBM Research report RJ 987 San Jose, California
Google Scholar
Tan, K.-L., Lu, H.: A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec. 20(4), 81–82 (1991)
Article Google Scholar
Lin, X., Orlowska, M.E.: An efficient processing of a chain join with the minimum communication cost in distributed database systems. Distrib. Parallel Databases 3(1), 69–83 (1995)
Article Google Scholar
Ordonez, C.: Optimizing recursive queries in SQL. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 834–839 (2005)
Google Scholar
Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, New York, NY, USA pp. 594–605 (2008)
Google Scholar
Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in mapreduce. In: Hameurlain, A., Kung, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV, pp. 33–70. Springer, Berlin Heidelberg (2016)
Chapter Google Scholar
Apache SparkTM - Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 14 Jun 2016
The Apache Cassandra Project. http://cassandra.apache.org/. Accessed 14 Jun 2016
Apache HBase - Apache HBaseTM Home. https://hbase.apache.org/. Accessed 14 Jun 2016
Amazon Simple Storage Service (S3) - Cloud Storage. https://aws.amazon.com/s3/. Accessed 14 Jun 2016
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, Berkeley, CA, USA, p. 2 (2012)
Google Scholar
Bancilhon, F.: Naive evaluation of recursively defined relations. In: Brodie, M.L., Mylopoulos, J. (eds.) On Knowledge Base Management Systems, pp. 165–178. Springer, New York (1986)
Chapter Google Scholar
Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press Inc., New York (1988)
Google Scholar
Ioannidis, Y.E.: On the computation of the transitive closure of relational operators. In: Proceedings of the 12th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 403–411 (1986)
Google Scholar
Warshall, S.: A theorem on boolean matrices. J. ACM 9(1), 11–12 (1962)
Article MathSciNet Google Scholar
Warren Jr., H.S.: A modification of warshall’s algorithm for the transitive closure of binary relations. Commun. ACM 18(4), 218–220 (1975)
Article MathSciNet Google Scholar
Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the 14th International Conference on Extending Database Technology, New York, NY, USA, pp. 1–8 (2011)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Article Google Scholar
Tran, T.T.Q.: Traitement de la jointure recursive en MapReduce. Universite Blaise Pascal-Clermont-Ferrand II, Clermont-Ferrand (2014)
Google Scholar
Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in hadoop. In: Proceedings of the Second International Conference on Datalog in Academia and Industry, Berlin, Heidelberg, pp. 165–176 (2012)
Google Scholar
Phan, T.-C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, New York, NY, USA, p. 2:1–2:2 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Can Tho University, Can Tho, Vietnam
Thuong-Cang Phan & Ngoan-Thanh Trieu
Vinh Long University of Technology Education, Vinh Long, Vietnam
Anh-Cang Phan
University of Rennes 1, Lannion, France
Thi-To-Quyen Tran

Authors

Thuong-Cang Phan
View author publications
You can also search for this author in PubMed Google Scholar
Anh-Cang Phan
View author publications
You can also search for this author in PubMed Google Scholar
Thi-To-Quyen Tran
View author publications
You can also search for this author in PubMed Google Scholar
Ngoan-Thanh Trieu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thuong-Cang Phan .

Editor information

Editors and Affiliations

Computer Science and Applications Department LGIPM, University of Lorraine, Metz Cedex 03, France
Hoai An Le Thi
Computer Science and Applications Department LGIPM, University of Lorraine, Metz Cedex 03, France
Hoai Minh Le
Laboratory of Mathematics, National Institute for Applied Sciences, Saint-Étienne-du-Rouvray Cedex, France
Tao Pham Dinh
Department of Information Systems, Wroclaw University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, TC., Phan, AC., Tran, TTQ., Trieu, NT. (2020). Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark. In: Le Thi, H., Le, H., Pham Dinh, T., Nguyen, N. (eds) Advanced Computational Methods for Knowledge Engineering. ICCSAMA 2019. Advances in Intelligent Systems and Computing, vol 1121. Springer, Cham. https://doi.org/10.1007/978-3-030-38364-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-38364-0_35
Published: 20 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38363-3
Online ISBN: 978-3-030-38364-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics