Definitions
The data structure at the core of nowadays large-scale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence ℓt, listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence ℓt is called the inverted list or posting list of the term t. The inverted index is the collection of all such lists.
The scope of the entry is the one of surveying the most important encoding algorithms developed for efficient inverted index compression and fast retrieval.
Overview
The inverted index owes its popularity to the efficient resolution of queries, expressed as a set of terms {t1, …, tk} combined with a query operator. The simplest operators are Boolean AND and OR. For example, given an AND query, the index has to report all the docIDs of the documents containing the terms {t1, …, tk}....
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Anh VN, Moffat A (2005) Inverted index compression using word-aligned binary codes. Inf Retr J 8(1):151–166
Anh VN, Moffat A (2010) Index compression using 64-bit words. Softw Pract Exp 40(2):131–147
Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23
Dean J (2009) Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of the 2nd international conference on web search and data mining (WSDM)
Delbru R, Campinas S, Tummarello G (2012) Searching web data: an entity retrieval and high-performance indexing model. J Web Semant 10:33–58
Dhulipala L, Kabiljo I, Karrer B, Ottaviano G, Pupyrev S, Shalita A (2016) Compressing graphs and indexes with recursive graph bisection. In: Proceedings of the 22nd international conference on knowledge discovery and data mining (SIGKDD), pp 1535–1544
Elias P (1974) Efficient storage and retrieval by content and address of static files. J ACM 21(2):246–260
Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21(2):194–203
Fano RM (1971) On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, MIT, Cambridge
Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering (ICDE), pp 370–379
Golomb S (1966) Run-length encodings. IEEE Trans Inf Theory 12(3):399–401
Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Data compression conference (DCC), pp 296–305
Lemire D, Boytsov L (2013) Decoding billions of integers per second through vectorization. Softw Pract Exp 45(1):1–29
Lemire D, Kurz N, Rupp C (2018) Stream-VByte: faster byte-oriented integer compression. Inf Process Lett 130:1–6
Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Inf Retr J 3(1): 25–47
Navarro G, Mäkinen V (2007) Compressed full-text indexes. ACM Comput Surv 39(1):1–79
Ottaviano G, Venturini R (2014) Partitioned elias-fano indexes. In: Proceedings of the 37th international conference on research and development in information retrieval (SIGIR), pp 273–282
Ottaviano G, Tonellotto N, Venturini R (2015) Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th annual international ACM conference on web search and data mining (WSDM), pp 47–56
Pibiri GE, Venturini R (2017) Clustered Elias-Fano indexes. ACM Trans Inf Syst 36(1):1–33. ISSN 1046-8188
Plaisance J, Kurz N, Lemire D (2015) Vectorized VByte decoding. In: International symposium on web algorithms (iSWAG)
Rice R, Plaunt J (1971) Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans Commun 16(9):889–897
Salomon D (2007) Variable-length codes for data compression. Springer, London
Silvestri F (2007) Sorting out the document identifier assignment problem. In: Proceedings of the 29th European conference on IR research (ECIR), pp 101–112
Silvestri F, Venturini R (2010) Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th international conference on information and knowledge management (CIKM), pp 1219–1228
Stepanov A, Gangolli A, Rose D, Ernst R, Oberoi P (2011) Simd-based decoding of posting lists. In: Proceedings of the 20th international conference on information and knowledge management (CIKM), pp 317–326
Vigna S (2013) Quasi-succinct indices. In: Proceedings of the 6th ACM international conference on web search and data mining (WSDM), pp 83–92
Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, San Francisco
Yan H, Ding S, Suel T (2009) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th international conference on world wide web (WWW), pp 401–410
Zhang Z, Tong J, Huang H, Liang J, Li T, Stones RJ, Wang G, Liu X (2016) Leveraging context-free grammar for efficient inverted index compression. In: Proceedings of the 39th international conference on research and development in information retrieval (SIGIR), pp 275–284
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Zukowski M, Héman S, Nes N, Boncz P (2006) Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 59–70
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Pibiri, G.E., Venturini, R. (2019). Inverted Index Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_52
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering