An Inverted file is an index data structure that maps content to its location within a database file, in a document or in a set of documents. It is normally composed of: (i) a vocabulary that contains all the distinct words found in a text and (ii), for each word t of the vocabulary, a list that contains statistics about the occurrences of t in the text. Such list is known as the inverted list of t. The inverted file is the most popular data structure used in document retrieval systems to support full text search.
Efforts for indexing electronic texts are found in literature since the beginning of the computational systems. For example, descriptions of Electronic Information Search Systems that are able to index and search text can be found in the early 1950s .
In a seminal work, Gerard Salton wrote a book in 1968, containing the basis for the modern information retrieval systems , including a description of a model largely adopted up to now for indexing texts, known as Vector Space Model. The inverted file was adopted as the index structure for implementing the Vector Space Model. It has been widely applied and studied since then.
Inverted files allow fast search for statistics related to the distinct words found in a text. They are projected for using words as the search unit, which restricts their use in applications where words are not clearly defined or in applications where the system does not use words as the search unit.
The statistics stored in an inverted file may vary according to the target application. Two alternatives usually found in literature are to record the position of all word occurrences in given text and to record general statistics about the word occurrences across text units. Indexing all the word occurrences is useful in applications where positional information should be taken into account, such as when it is necessary to allow search for phrases or proximity queries. Inverted files that store word statistics are usually deployed in systems that adopt information retrieval models, such as the Vector Space Model. In this particular case the text is divided into units of information, usually documents. For instance, in web search engines, these units are the pages crawled from the web, while the whole set of pages compose the indexed text.
Querying a word in an inverted index consists in first locating the word in the vocabulary and getting the position of its inverted list. This operation can be performed in O(1) by using a hash algorithm. The inverted list of the word is then accessed in order to provide the search results. The vocabulary usually requires a sub-linear space when compared to the size of the inverted list, which makes it usually far smaller than these lists.
Word occurrences are stored in inverted files for applications where positional information should be taken into account, such as when it is desirable to provide support to phrase search or proximity queries. To search for a phrase or proximity pattern (where the words must appear consecutively or close to each other, respectively), each word is searched separately. Then, the resulting lists of occurrences are intersected considering the consecutiveness or closeness of the word positions in the text. The cost to perform such type of queries can be reduced by adopting an auxiliary data structure, known as next word index, which includes information about the next word in the positional inverted list entries. This alternative can significantly reduce the query processing times for phrase queries. Another choice could be to index pairs of consecutive words, but then the vocabulary would be much larger, which would make this option unfeasible.
Building an Inverted File
The texts indexed nowadays by search systems are usually too large for allowing the creation of inverted files completely in main memory. Disk-based algorithms for generating compressed inverted files have been extensively studied in the literature. An example is the multiway merging algorithm described in .
The sequential algorithm shown above uses two passes for reading and parsing of the documents in the collection. This allows building a perfect hashed vocabulary which provides for direct access to any inverted list with no need to lookup at a vocabulary entry. Thus, once the perfect hash has been built, it is no longer necessary to keep the vocabulary in memory (all significant memory consumption is now represented by the buffer B which stores the inverted lists).
In cases where the text is too large, as it happens in search engines that try to index the whole web, distributed algorithms should be adopted for building inverted files. The current best alternative for building distributed inverted files is also the simplest solution. It partitions the text into small sub-collections, each of them fitting in a single machine. A local index is built for each sub-collection; when queries arrive, they are submitted to every sub-collection and evaluated against every local index. A final step merges the answers produced by each individual machine yielding a single ranking of results to the final users.
A technique to reduce the space requirements of inverted files is to compress the index. The key idea to reduce the size of inverted files is that the inverted list entries related to each word can be sorted in increasing order, and therefore the gaps between consecutive positions can be stored instead of the absolute values. Then, compression techniques for small integers can be used. As the gaps are smaller for longer lists, longer lists can be compressed better. Previous work has shown that inverted files can be reduced up to 10% of their original size without degrading the performance, and even the performance may improve because of reduced I/O .
Another alternative for reducing the space requirements of an inverted file and the query processing regarding the access to the index is to minimize the number of indexed entries by applying static pruning methods. Pruning methods try to avoid processing index entries without cause loss of quality in the final results produced by the search system. They can be classified as dynamic and static. Dynamic methods maintain the index completely stored on disk and use heuristics to avoid reading unnecessary information at query processing time. In this case, the amount of pruning performed varies according to the user queries, which represents an advantage, since the methods can be better adapted to each specific query. In contrast, static methods try to predict, at index construction time, the entries which will not be useful at query processing time. These entries are then removed from the index. For this reason, static methods can be seen as lossy compression methods. Static methods offer the advantage of both reducing the disk storage costs and time to process each query. A system that uses both static and dynamic methods can also be implemented to take advantage of the two types of pruning options .
In applications where the indexed text changes over the time, with portions being removed, added or changed in the text, it is necessary to reflect such changes in the inverted file. The simplest approach is to rebuild the whole index, which may be acceptable if the index can be updated offline and the indexing time is small. However, if such conditions do not apply, more sophisticated strategies should be adopted. Several index maintenance strategies can be found in literature . They can be divided into three categories, with the index rebuilding being the first obvious choice. The second category is the intermittent merge, where small indexes to register updates are stored in main memory, making the update inexpensive. In this case, the temporary main memory index and the disk index should be merged at query processing time. A real update should be periodically performed to avoid a memory overflow in the temporary index. The third category is the incremental update. It updates the main index term by term using a process similar to the mechanisms used for maintaining variable-length records in conventional database management systems.
The query processing over inverted files can be performed in two distinct forms, being based on a term order or on a document order basis . In the term order basis, also named as term-at-a-time (TAAT), each inverted list is completely processed at a time. The partial list of results obtained after processing each term list is stored in memory, which means this method may require additional memory when processing queries. The second form of processing queries is the document order processing, also named as document-at-a-time (DAAT), where whenever a document information is found in one of the lists, all information about this document is automatically read from the remaining inverted lists of terms present in the query. The document ordering method requires the inverted lists to be stored sorted by document number, or by occurrence when the in- dex store all term occurrences. Several authors proposed methods for efficiently processing queries using TAAT strategies [5, 6, 7] and DAAT strategies [8, 9] or also hybrid approaches combining both strategies . The choice of the best algorithm and strategy depends on parameters such as size of the collection, expected number of terms per query and the ranking model adopted. For instance, TAAT strategies or hybrid approaches have better performance in applications with more terms per query.
The document ordering method requires the inverted lists to be stored sorted by document number, or by occurrence when the index store all term occurrences. Previous work indicate the term ordering method results in faster query evaluation. However, for small queries, which are common on many search applications, this difference becomes smaller. A combination of document order and term order may also be implemented, by processing the inverted lists in blocks.
A final comment about query processing is that it can be sped up by using cache strategies. At least three distinct cache layers have been proposed in literature. First, the system can adopt a cache of inverted lists to keep the most frequent portions of the lists in memory. Second, it can also be used a cache of results, which takes the final results provided to the users in a cache. Finally, a projection cache containing frequent intersections of lists can also be adopted. Previous work in literature conclude that these cache techniques can significantly increase the maximum capacity of systems for query processing .
Inverted files are by far the most applied indexing structures in text search systems. Such indexes are used, for instance, in the popular large scale web search engines.
- 2.Salton G. Automatic information organization and retrieval. New York: McGraw-Hill; 1968.Google Scholar
- 3.de Moura ES, dos Santos CF, Fernandes DR, Silva AS, Calado P, Nascimento MA. Improving web search efficiency via a locality based static pruning method. Proceedings of 14th International World Wide Web Conference; 2005. p. 235–44.Google Scholar
- 4.Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. 2nd ed. Reading: Addison Wesley; 2011.Google Scholar
- 5.Anh V, Moffat A. Pruned query evaluation using pre-computed impacts. ACM SIGIR. 2006. p. 372–9.Google Scholar
- 6.Anh V, Kretser O de, Moffat A. Vector-space ranking with effective early termination. ACM SIGIR. 2001. p. 35–42.Google Scholar
- 7.Strohman T, Croft WB. Efficient document retrieval in main memory. ACM SIGIR; 2007. p. 175–82.Google Scholar
- 8.Ding S, Suel T. Faster top-k document retrieval using block-max in- dexes. ACM SIGIR. 2011. p. 993–1002.Google Scholar
- 9.Rossi C, de Moura ES, Carvalho AL, da Silva AS. Fast document-at-a-time query processing using two-tier indexes. ACM SIGIR; 2013. p. 183–92.Google Scholar
- 10.Marcus F, Vanja J, Jinhui L, Srihari V, Xi-angfei Z, Zien Jason Y. Evaluation strategies for top-k queries over memory- resident inverted indexes. PVLDB. 2011;4(12):1213–24.Google Scholar
- 11.Long X, Suel T. Three-level caching for efficient query processing in large Web search engines. Proceedings of 14th International World Wide Web Conference; 2005. p. 257–66.Google Scholar
- 12.Kaszkiel M, Zobel J. Term-ordered query evaluation versus document- ordered query evaluation for large document databases. Proceedings of 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1998. p. 343–4.Google Scholar