A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection

  • Yitong GaoEmail author
  • Yan Zhang
  • Hongzhi Wang
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9645)


Big data quality detection is a valuable problem in data quality field. MapReduce is an important distributed data processing model mainly for big data processing. Load balance is a key factor that influences the property of MapReduce. In this paper, we propose a distributed greedy approximation algorithm for load balance problem in MapReduce for data quality detection. There are three key challenges: (a) reduce the problem to NP-complete and prove a considerable approximation ratio of the proposed algorithm, (b) just impose one more round of MapReduce than conventional processing and occupy minimal time in the total process, (c) be simple and convenient feasible. Experimental results on real-life and synthetic data demonstrate that the proposed algorithm in this paper is effective for load balance.


Load balance Mapreduce Data quality detection Distributed approximation greedy algorithm 


  1. 1.
    Hadoop, W.T.: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)Google Scholar
  2. 2.
    Michael, R.G., David, S.J.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., San Francisco (1979)zbMATHGoogle Scholar
  3. 3.
  4. 4.
  5. 5.
    Williamson, D.P., Shmoys, D.B.: The Design of Approximation Algorithms. Cambridge University Press, Cambridge (2011)CrossRefzbMATHGoogle Scholar
  6. 6.
    Kolb, L., Thor, A., Rahm, E.: Block-based load balancing for entity resolution with MapReduce. In: Proceedings of the 20th ACM International Conference on Information, Knowledge Management, pp. 2397–2400. ACM (2011)Google Scholar
  7. 7.
    Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 618–629. IEEE (2012)Google Scholar
  8. 8.
    Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 16. ACM (2012)Google Scholar
  9. 9.
    Fan, L., Gao, B., Zhang, F., et al.: OS4M: achieving global load balance of MapReduce workload by scheduling at the operation level (2014). arXiv preprint arXiv:1406.3901
  10. 10.
    Fan, L., Gao, B., Sun, X., et al.: Improving the load balance of mapreduce operations based on the key distribution of pairs (2014). arXiv preprint arXiv:1401.0355
  11. 11.
    Xu, Y., Zou, P., Qu, W., et al.: Sampling-based partitioning in MapReduce for skewed data. In: ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh, pp. 1–8. IEEE (2012)Google Scholar
  12. 12.
    Fan, Y., Wu, W., Cao, H., et al.: LBVP: a load balance algorithm based on virtual partition in Hadoop cluster. In: Cloud Computing Congress (APCloudCC), 2012 IEEE Asia Pacific, pp. 37–41. IEEE (2012)Google Scholar
  13. 13.
    Martha, V.S., Zhao, W., Xu, X.: h-MapReduce: a framework for workload balancing in MapReduce. In: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp. 637–644. IEEE (2013)Google Scholar
  14. 14.
    Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. Proc. VLDB Endowment 7(12), 1059–1070 (2014)CrossRefGoogle Scholar
  15. 15.
    Hou, X., Thomas, J.P., Varadharajan V.: Dynamic workload balancing for Hadoop MapReduce. In: 2014 IEEE Fourth International Conference on Big Data and Cloud Computing (BdCloud), pp. 56–62. IEEE (2014)Google Scholar
  16. 16.
    Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. ACM Trans. Database Syst. (TODS) 37(4), 25 (2012)CrossRefGoogle Scholar
  17. 17.
    Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 565–576. ACM (2013)Google Scholar
  18. 18.
    Kwon, Y.C., Balazinska, M., Howe, B., et al.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)Google Scholar
  19. 19.
    Fan, W., Geerts, F., Tang, N., et al.: Inferring data currency and consistency for conictresolution. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 470–481. IEEE (2013)Google Scholar
  20. 20.
    Fan, W., Geerts, F., Tang, N., et al.: Conflict resolution with data currency and consistency. J. Data Inf. Qual. (JDIQ) 5(1–2), 6 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Yitong Gao
    • 1
    Email author
  • Yan Zhang
    • 1
  • Hongzhi Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations