Advertisement

Frontiers of Computer Science

, Volume 13, Issue 2, pp 343–356 | Cite as

Optimizing partitioning strategies for faster inverted index compression

  • Xingshen SongEmail author
  • Yuexiang Yang
  • Yu Jiang
  • Kun Jiang
Research Article
  • 11 Downloads

Abstract

The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.

Keywords

inverted index index compression optimal partition approximation algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11704_2016_6252_MOESM1_ESM.ppt (320 kb)
Supplementary material, approximately 320 KB.

References

  1. 1.
    Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008zbMATHGoogle Scholar
  2. 2.
    Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999zbMATHGoogle Scholar
  3. 3.
    Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6Google Scholar
  4. 4.
    Catena M, Macdonald C, Ounis I. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371Google Scholar
  5. 5.
    Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29Google Scholar
  6. 6.
    Ottaviano G, Tonellotto N, Venturini R. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56Google Scholar
  7. 7.
    Silvestri F, Venturini R. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228Google Scholar
  8. 8.
    Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410Google Scholar
  9. 9.
    Ottaviano G, Grossi R. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494Google Scholar
  10. 10.
    Anh V N, Moffat A. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166Google Scholar
  11. 11.
    Anh V N, Moffat A. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147Google Scholar
  12. 12.
    Anh V N, Moffat A. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67Google Scholar
  13. 13.
    Delbru R, Campinas S, Tummarello G. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58Google Scholar
  14. 14.
    Ottaviano G, Venturini R. Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282Google Scholar
  15. 15.
    Ferragina P, Nitto I, Venturini R. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74MathSciNetzbMATHGoogle Scholar
  16. 16.
    Trotman A. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014Google Scholar
  17. 17.
    Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002Google Scholar
  18. 18.
    Navarro G, Puglisi S J. Dual-sorted inverted lists. In: Proceedings of String Processing and Information Retrieval. 2010, 309–321Google Scholar
  19. 19.
    Dimopoulos C, Nepomnyachiy S, Suel T. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACM International Conference onWeb Search and DataMining. 2013, 113–122Google Scholar
  20. 20.
    Stepanov A A, Gangolli A R, Rose D E, Ernst R J, Oberoi P S. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326Google Scholar
  21. 21.
    Zhao W X, Zhang X, Lemire D, Shan D, Nie J Y, Yan H F, Wen J R. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15Google Scholar
  22. 22.
    Goldstein J, Ramakrishnan R, Shaft U. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379Google Scholar
  23. 23.
    Boldi P, Vigna S. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28Google Scholar
  24. 24.
    Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542Google Scholar
  25. 25.
    Sacco G M. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421Google Scholar
  26. 26.
    Culpepper J S, Moffat A. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1Google Scholar
  27. 27.
    Ao N Y, Zhang F, Wu D, Stones D S, Wang G, Liu X G, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481Google Scholar
  28. 28.
    Lemire D, Boytsov L, Kurz N. SIMD Compression and the Intersection of Sorted Integers. Software: Practice and Experience, 2015Google Scholar
  29. 29.
    Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009zbMATHGoogle Scholar
  30. 30.
    Gog S, Venturini R. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Xingshen Song
    • 1
    Email author
  • Yuexiang Yang
    • 1
  • Yu Jiang
    • 1
  • Kun Jiang
    • 2
  1. 1.College of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.School of Electronic and Information EngineeringXi’an Jiaotong UniversityXi’anChina

Personalised recommendations