Encyclopedia of Database Systems

2018 Edition
Editors: Ling Liu, M. Tamer Özsu

Indexing Compressed Text

  Paolo Ferragina
  • Rossano Venturini
Compressed and searchable data format; Compressed full-text indexing; Compressed suffix array; Compressed suffix tree


Given a text T[1,n], the Compressed Text Indexing problem requires to building an indexing data structure over T that takes space close to the empirical entropy of the input text and answers queries on the occurrences of an arbitrary pattern P[1, p] in T without any significant slowdown with respect to uncompressed indexes. There are three main queries: count(P), which returns the number of pattern occurrences in T; locate(P), which returns the starting positions of all pattern occurrences in T; and extract(i, j), which retrieves the substring T[i, j].

Historical Background

String processing and searching tasks are at the core of modern web search, information retrieval (IR), data base, and data mining applications. Most of text manipulations required by these applications involve, sooner or later, searchingthose (long) texts for (short) patterns...

