An Efficient Strategy of Building Distributed Index Based on Lucene
- 1.3k Downloads
With the arrival of big data era, the increasing scale of data available poses a great challenge to industry and academia. Efficient query and retrieval of large amount of data becomes more and more necessary. In this paper, we propose an efficient and smooth strategy of building distributed index for large amount of text. In order to improve memory usage and less manual intervention, the proposed strategy uses dynamic threshold setting other than static threshold. Dynamic threshold setting can also avoid Out of Memory(OOM) issue. For the purpose of loading balance, we also design a novel MinHeapPartition strategy to replace the default HashPartition. Because of continuous sending the intermediate data to the reducer with the lowest loading, the MinHeapPartition strategy can maximally make sure each reducer process approximately equal data loading. To validate the proposed strategy in efficiency and scalability, we build a distributed index based on Apache Hadoop and Lucene open source framework. In our experiment, we successfully index up to 1.02TB text data. Experiment results show that our strategy achieves 20% performance improvement.
KeywordsMapReduce Hadoop Lucene distributed index
Unable to display preview. Download preview PDF.
- 1.Ribeiro, de Arajo Neto, B., Baeza-Yates, R.: Modern information retrieval, p. 192. Addison-Wesley Longman, Reading (1999) ISBN 0-201-39829-XGoogle Scholar
- 2.Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and implementation (OSDI 2004), San Francisco, California, pp. 137–150 (2004)Google Scholar
- 3.Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), Bolton Landing, New York, pp. 29–43 (2003)Google Scholar
- 5.Gufler, B.: Load Balancing in MapReduce Based on Scalable Cardinality Estimates. Data Engineering (ICDE). In: 2012 IEEE 28th International Conference, pp. 522–533. IEEE Press, Washington, DC (2012)Google Scholar
- 7.Jiang, D.: The performance of MapReduce: An in-depth survey. Proceedings of the VLDB Endowment, 472–483 (2012)Google Scholar
- 8.Jiang, D., et al.: Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering (2010)Google Scholar
- 10.Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and ecient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)Google Scholar
- 11.Justin, Z., Moffat, A.: Inverted Files for Text Search Engines. ACM Computing Surveys 38(2), 1–56 (2006)Google Scholar