Advertisement

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study

  • Joel MackenzieEmail author
  • Antonio Mallia
  • Matthias Petri
  • J. Shane Culpepper
  • Torsten Suel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)

Abstract

Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.

Keywords

Reordering Compression Efficiency Reproducibility 

Notes

Acknowledgments

This work was supported by the National Science Foundation (IIS-1718680), the Australian Research Council (DP170102231), and the Australian Government (RTP Scholarship).

References

  1. 1.
    Arguello, J., Diaz, F., Lin, J., Trotman, A.: SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). In: Proceedings of SIGIR, pp. 1147–1148 (2015)Google Scholar
  2. 2.
    Blanco, R., Barreiro, Á.: Document identifier reassignment through dimensionality reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 375–387. Springer, Heidelberg (2005).  https://doi.org/10.1007/978-3-540-31865-1_27CrossRefGoogle Scholar
  3. 3.
    Blanco, R., Barreiro, Á.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: Proceedings of SIGIR, pp. 587–588 (2005)Google Scholar
  4. 4.
    Blanco, R., Barreiro, Á.: TSP and cluster-based solutions to the reassignment of document identifiers. Inf. Retr. 9(4), 499–517 (2006)CrossRefGoogle Scholar
  5. 5.
    Blandford, D., Blelloch, G.: Index compression through document reordering. In: Proceedings DCC 2002, Data Compression Conference, pp. 342–352 (2002)Google Scholar
  6. 6.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: Proceedings of SIGKDD, pp. 219–228 (2009)Google Scholar
  8. 8.
    Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In: Proceedings of WSDM, pp. 201–210 (2017)Google Scholar
  9. 9.
    Dean, J.: Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of WSDM, pp. 1–1 (2009)Google Scholar
  10. 10.
    Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., Shalita, A.: Compressing graphs and indexes with recursive graph bisection. In: Proceedings of SIGKDD, pp. 1535–1544 (2016)Google Scholar
  11. 11.
    Ding, S., Suel, T.: Faster top-\(k\) document retrieval using block-max indexes. In: Proceedings of SIGIR, pp. 993–1002 (2011)Google Scholar
  12. 12.
    Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the WWW, pp. 311–320 (2010)Google Scholar
  13. 13.
    Fredriksson, K., Kilpeläinen, P.: Practically efficient array initialization. Soft. Prac. Exp. 46(4), 435–467 (2016)CrossRefGoogle Scholar
  14. 14.
    Hasibi, F., Balog, K., Bratsberg, S.E.: On the reproducibility of the TAGME entity linking system. In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Di Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 436–449. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-30671-1_32CrossRefGoogle Scholar
  15. 15.
    Hawking, D., Jones, T.: Reordering an index to speed query processing without loss of effectiveness. In: Proceedings of ADCS, pp. 17–24 (2012)Google Scholar
  16. 16.
    Kane, A., Tompa, F.W.: Split-lists and initial thresholds for WAND-based search. In: Proceedings of SIGIR, pp. 877–880 (2018)Google Scholar
  17. 17.
    Lemire, D., Kurz, N., Rupp, C.: Stream vbyte: faster byte-oriented integer compression. Inf. Proc. Lett. 130, 1–6 (2018)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Mallia, A., Ottaviano, G., Porciani, E., Tonellotto, N., Venturini, R.: Faster BlockMax WAND with variable-sized blocks. In: Proceedings of SIGIR, pp. 625–634 (2017)Google Scholar
  19. 19.
    Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Inf. Retr. 3(1), 25–47 (2000)CrossRefGoogle Scholar
  20. 20.
    Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of SIGIR, pp. 273–282 (2014)Google Scholar
  21. 21.
    Richardson, M., Prakash, A., Brill, E.: Beyond pagerank: machine learning for static ranking. In: Proceedings of WWW, pp. 707–715 (2006)Google Scholar
  22. 22.
    Shieh, W.-Y., Chen, T.-F., Shann, J.J.-J., Chung, C.-P.: Inverted file compression through document identifier reassignment. Inf. Proc. Man. 39(1), 117–131 (2003)CrossRefGoogle Scholar
  23. 23.
    Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-71496-5_12CrossRefGoogle Scholar
  24. 24.
    Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proceedings of WWW, pp. 401–410 (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Joel Mackenzie
    • 1
    Email author
  • Antonio Mallia
    • 2
  • Matthias Petri
    • 3
  • J. Shane Culpepper
    • 1
  • Torsten Suel
    • 2
  1. 1.RMIT UniversityMelbourneAustralia
  2. 2.New York UniversityNew YorkUSA
  3. 3.The University of MelbourneMelbourneAustralia

Personalised recommendations