An Efficient Two-Table Join Query Processing Based on Extended Bloom Filter in MapReduce

  • Junlu Wang
  • Jun Pang
  • Xiaoyan Li
  • Baishuo Han
  • Lei Huang
  • Linlin DingEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9998)


With the development of Cloud Computing, the Internet of things and some similar technologies, a large amount of data has been produced. MapReduce as a processing architecture for Cloud Computing has been widely used. It can achieve large-scale data processing. However, when connecting two tables on the data processing model of MapReduce, there will be a great deal of data that do not meet the conditions of the connection. These data will also be transferred from the map side to the reduce side. It will bring more time overhead and I/O cost at shuffle stage, which will result in low efficiency. Therefore, how to improve the join query processing algorithm based on the MapReduce has been an urgent problem. In this paper, we put forward two-table join query processing and optimization strategies for the above problems. The optimized method can achieve the expansion of the Bloom Filter. Meanwhile it can reduce the time of shuffle phase, and improve the efficiency of the system.


Mapreduce Bloom Filter Join query processing and optimization 



This work is supported by National Natural Science Foundation of China under Grant (Nos. 61472169, 61502215); Science Research Normal Fund of Liaoning Province Education Department (L2015193); Doctoral Scientific Research Start Foundation of Liaoning Province (201501127); the Young Research Foundation of Liaoning University under Grant No. LDQN201438.


  1. 1.
    Mishra, P., Erich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24, 63–113 (1992)CrossRefGoogle Scholar
  2. 2.
    Ramakrishnan, R.: Database Management Systems. McGraw -Hill Inc, New York (1997)zbMATHGoogle Scholar
  3. 3.
    Garcia-Molina, H., Widow, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999)Google Scholar
  4. 4.
    Kwan, S.C., Baer, J.-L.: The I/O performance of multiway merge sort and tag sort. IEEE Trans. Comput. 34, 383–387 (1985)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Fushimi, S., Kitsureqawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine GRACE. In: Proceedings of the Very Large DataBases Conference, pp. 209–219 (1986)Google Scholar
  6. 6.
    Dewitt, D.J., Katz, R.H., Olken, F., et al.: Implementation techniques for main memory database systems. In: Proceedings of the ACM SIGMOD International Conference, pp. 1–8 (1984)Google Scholar
  7. 7.
    Stamos, J.W., Young, H.C.: A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4(12), 1345–1354 (1993)CrossRefGoogle Scholar
  8. 8.
    Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in mapreduceGoogle Scholar
  9. 9.
    Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduceGoogle Scholar
  10. 10.
    Zhang, C., Li, J., Wu, L.: Optimizing Theta-Joins in a mapreduce environment. Int. J. Database Theory Appl. 6(4), 91–108 (2013)Google Scholar
  11. 11.
    Koumarelas, I.K., Naskos, A., Gounaris, A.: Binary Theta-Joins using mapreduce: efficiency analysis and improvementsGoogle Scholar
  12. 12.
    Okcan, A., Riedewald, M.: Processing Theta-Joins using mapreduceGoogle Scholar
  13. 13.
    White, T.: Hadoop: The Definitive Guide, 2nd edn. O’Reilly Media, Inc., California (2011). pp. 247–249Google Scholar
  14. 14.
    Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), pp. 975–986 (2010)Google Scholar
  15. 15.
    Hui, S.: Large data set connection optimization algorithm based on Hadoop framework. Nanjing University of Posts and Telecommunications (2013)Google Scholar
  16. 16.
    Lin, Y., Agrawal, D, Chun, C., et al.: Llama: leveraging columnar storage for scalable join. In: Proceedings of SIGMOD 2011. ACM, New York (2011)Google Scholar
  17. 17.
    Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD 2007), pp. 1029–1040 (2007)Google Scholar
  18. 18.

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Junlu Wang
    • 1
  • Jun Pang
    • 2
  • Xiaoyan Li
    • 1
  • Baishuo Han
    • 1
  • Lei Huang
    • 1
  • Linlin Ding
    • 1
    Email author
  1. 1.School of InformationLiaoning UniversityShenyangChina
  2. 2.School of Information Science and EngineeringNortheastern UniversityShenyangChina

Personalised recommendations