Abstract
MapReduce has become the dominant programming model for analyzing and processing large-scale data. However, the model has its own limitations. It does not completely support iterative computation, caching mechanism, and operations with multiple inputs. Besides, I/O and communication costs of the model are so expensive. One of the most notably complex operations extensively and expensively used in MapReduce is recursive joins. It requires processing characteristics that are the limitations of a MapReduce environment. Therefore, this research proposes efficient solutions for processing recursive joins in Spark, a next-generation data processing engine of MapReduce. Our proposal eliminates a large amount of redundant data generated in repeated join steps and takes advantages of in-memory computing means and cache mechanism. Through experiments, the present research shows that our solutions significantly improve the execution performance of recursive joins on large-scale datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107 (2008)
Apache Hive TM. https://hive.apache.org/. Accessed 14 Jun 2016
Wiley, K., Connolly, A., Krughoff, S., Gardner, J., Balazinska, M., Howe, B., Kwon, Y., Bu, Y.: Astronomical Image Processing with Hadoop. ResearchGate, July 2011
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. ResearchGate, January 1998
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Bancilhon, F., Ramakrishnan, R.: An amateurs introduction to recursive query processing strategies (1986)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review (1999)
Hagan, M.T., Demuth, H.B., Beale, M.H., De Jess, O.: Neural Network Design, 2nd edn. Martin Hagan, Atlanta (2014)
Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge (1994)
Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, pp. 50–60 (2005)
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)
Codd, E.F.: Relational completeness of data base sublanguages. In: Rustin, R. (ed) Database System, pp. 65–98. Prentice Hall (1972). IBM Research report RJ 987 San Jose, California
Tan, K.-L., Lu, H.: A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec. 20(4), 81–82 (1991)
Lin, X., Orlowska, M.E.: An efficient processing of a chain join with the minimum communication cost in distributed database systems. Distrib. Parallel Databases 3(1), 69–83 (1995)
Ordonez, C.: Optimizing recursive queries in SQL. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 834–839 (2005)
Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, New York, NY, USA pp. 594–605 (2008)
Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in mapreduce. In: Hameurlain, A., Kung, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV, pp. 33–70. Springer, Berlin Heidelberg (2016)
Apache SparkTM - Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 14 Jun 2016
The Apache Cassandra Project. http://cassandra.apache.org/. Accessed 14 Jun 2016
Apache HBase - Apache HBaseTM Home. https://hbase.apache.org/. Accessed 14 Jun 2016
Amazon Simple Storage Service (S3) - Cloud Storage. https://aws.amazon.com/s3/. Accessed 14 Jun 2016
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, Berkeley, CA, USA, p. 2 (2012)
Bancilhon, F.: Naive evaluation of recursively defined relations. In: Brodie, M.L., Mylopoulos, J. (eds.) On Knowledge Base Management Systems, pp. 165–178. Springer, New York (1986)
Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press Inc., New York (1988)
Ioannidis, Y.E.: On the computation of the transitive closure of relational operators. In: Proceedings of the 12th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 403–411 (1986)
Warshall, S.: A theorem on boolean matrices. J. ACM 9(1), 11–12 (1962)
Warren Jr., H.S.: A modification of warshall’s algorithm for the transitive closure of binary relations. Commun. ACM 18(4), 218–220 (1975)
Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the 14th International Conference on Extending Database Technology, New York, NY, USA, pp. 1–8 (2011)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Tran, T.T.Q.: Traitement de la jointure recursive en MapReduce. Universite Blaise Pascal-Clermont-Ferrand II, Clermont-Ferrand (2014)
Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in hadoop. In: Proceedings of the Second International Conference on Datalog in Academia and Industry, Berlin, Heidelberg, pp. 165–176 (2012)
Phan, T.-C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, New York, NY, USA, p. 2:1–2:2 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Phan, TC., Phan, AC., Tran, TTQ., Trieu, NT. (2020). Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark. In: Le Thi, H., Le, H., Pham Dinh, T., Nguyen, N. (eds) Advanced Computational Methods for Knowledge Engineering. ICCSAMA 2019. Advances in Intelligent Systems and Computing, vol 1121. Springer, Cham. https://doi.org/10.1007/978-3-030-38364-0_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-38364-0_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38363-3
Online ISBN: 978-3-030-38364-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)