Advertisement

An Efficient Strategy of Building Distributed Index Based on Lucene

  • Tiangang Zhu
  • Yuanchun Zhou
  • Yang Zhang
  • Zhenghua Xue
  • Jiwu Bai
  • Jianhui Li
Conference paper
  • 1.3k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7901)

Abstract

With the arrival of big data era, the increasing scale of data available poses a great challenge to industry and academia. Efficient query and retrieval of large amount of data becomes more and more necessary. In this paper, we propose an efficient and smooth strategy of building distributed index for large amount of text. In order to improve memory usage and less manual intervention, the proposed strategy uses dynamic threshold setting other than static threshold. Dynamic threshold setting can also avoid Out of Memory(OOM) issue. For the purpose of loading balance, we also design a novel MinHeapPartition strategy to replace the default HashPartition. Because of continuous sending the intermediate data to the reducer with the lowest loading, the MinHeapPartition strategy can maximally make sure each reducer process approximately equal data loading. To validate the proposed strategy in efficiency and scalability, we build a distributed index based on Apache Hadoop and Lucene open source framework. In our experiment, we successfully index up to 1.02TB text data. Experiment results show that our strategy achieves 20% performance improvement.

Keywords

MapReduce Hadoop Lucene distributed index 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ribeiro, de Arajo Neto, B., Baeza-Yates, R.: Modern information retrieval, p. 192. Addison-Wesley Longman, Reading (1999) ISBN 0-201-39829-XGoogle Scholar
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and implementation (OSDI 2004), San Francisco, California, pp. 137–150 (2004)Google Scholar
  3. 3.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), Bolton Landing, New York, pp. 29–43 (2003)Google Scholar
  4. 4.
    Lin, J., Schatz, M.: Design patterns for efficient graph algorithms in MapReduce. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 78–85. ACM, New York (2010)CrossRefGoogle Scholar
  5. 5.
    Gufler, B.: Load Balancing in MapReduce Based on Scalable Cardinality Estimates. Data Engineering (ICDE). In: 2012 IEEE 28th International Conference, pp. 522–533. IEEE Press, Washington, DC (2012)Google Scholar
  6. 6.
    Chen, Z., Zhu, C., Cheng, W., Song, Q., Cai, S.: Research of distributed index based on lucene. In: Jin, D., Lin, S. (eds.) Advances in EECM Vol. 1. LNEE, vol. 139, pp. 115–121. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Jiang, D.: The performance of MapReduce: An in-depth survey. Proceedings of the VLDB Endowment, 472–483 (2012)Google Scholar
  8. 8.
    Jiang, D., et al.: Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering (2010)Google Scholar
  9. 9.
    Babu, S.: Towards automatic optimization of mapreduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 137–142. ACM, New York (2010)CrossRefGoogle Scholar
  10. 10.
    Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and ecient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)Google Scholar
  11. 11.
    Justin, Z., Moffat, A.: Inverted Files for Text Search Engines. ACM Computing Surveys 38(2), 1–56 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Tiangang Zhu
    • 1
    • 4
  • Yuanchun Zhou
    • 1
  • Yang Zhang
    • 1
  • Zhenghua Xue
    • 2
  • Jiwu Bai
    • 3
  • Jianhui Li
    • 1
  1. 1.Computer Network Information CenterChinese Academy of SciencesChina
  2. 2.Chanjet Information Technology Co. Ltd.China
  3. 3.Jiyuan Power Supply Company of Henan Electric Power CompanyHenanChina
  4. 4.University of Chinese Academy of SciencesChina

Personalised recommendations