Advertisement

Compressed String Dictionaries

  • Nieves R. Brisaboa
  • Rodrigo Cánovas
  • Francisco Claude
  • Miguel A. Martínez-Prieto
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6630)

Abstract

The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web engines, RDF graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. Thus efficient approaches to compress them are necessary. In this paper we empirically compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20% of the original size of the strings is possible while supporting dictionary searches within a few microseconds, and up to 10% within a few tens or hundreds of microseconds.

Keywords

Hash Function Hash Table Binary Search Wavelet Tree Target Symbol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A., Drovandi, G.: Graph compression by BFS. Algorithms 2, 1031–1044 (2009)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)Google Scholar
  3. 3.
    Boldi, P., Vigna, S.: The Webgraph framework i: Compression techniques. In: Proc. WWW, pp. 595–602 (2004)Google Scholar
  4. 4.
    Brisaboa, N., Ladra, S., Navarro, G.: Directly addressable variable-length codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33, 309–320 (2000)CrossRefGoogle Scholar
  6. 6.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (1994)Google Scholar
  7. 7.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill (2001)zbMATHGoogle Scholar
  9. 9.
    Donato, D., Laura, L., Leonardi, S., Meyer, U., Millozzi, S., Sibeyn, J.: Algorithms and experiments for the Webgraph. J. Graph Algor. App. 10(2), 219–236 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Fernández, J.D., Martínez-Prieto, M.A., Gutierrez, C.: Compact representation of large RDF data sets for publishing and exchange. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 193–208. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM JEA 13, article 12 (2009)Google Scholar
  12. 12.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. FOCS, pp. 390–398 (2000)Google Scholar
  13. 13.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007)Google Scholar
  14. 14.
    Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Alg. 7(1), article 10 (2010)Google Scholar
  15. 15.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. FOCS, pp. 184–196 (2005)Google Scholar
  16. 16.
    González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. Posters WEA, pp. 27–38 (2005)Google Scholar
  17. 17.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. SODA, pp. 841–850 (2003)Google Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, Cambridge (2007)zbMATHGoogle Scholar
  19. 19.
    Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, London (1978)zbMATHGoogle Scholar
  20. 20.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers 40(9), 1098–1101 (1952)zbMATHGoogle Scholar
  21. 21.
    Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: The Web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–17. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  22. 22.
    Knuth, D.E.: The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, Reading (2007)Google Scholar
  23. 23.
    Larsson, N.J., Moffat, J.A.: Offline dictionary-based compression. Proc. of the IEEE 88, 1722–1732 (2000)CrossRefGoogle Scholar
  24. 24.
    Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  25. 25.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  26. 26.
    Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995. LNCS, vol. 955, pp. 393–402. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  27. 27.
    Nagwani, N.: Clustering based URL normalization technique for Web mining. In: Proc. ACE, pp. 349–351 (2010)Google Scholar
  28. 28.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)Google Scholar
  29. 29.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. SODA, pp. 233–242 (2002)Google Scholar
  30. 30.
    Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Suel, T., Yuan, J.: Compressing the graph structure of the Web. In: Proc. DCC, pp. 213–222 (2001)Google Scholar
  32. 32.
    Williams, H., Zobel, J.: Compressing integers for fast file access. The Computer Journal 42, 193–201 (1999)CrossRefGoogle Scholar
  33. 33.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)zbMATHGoogle Scholar
  34. 34.
    Yin, M., Goh, D., Lim, E.-P., Sun, A.: Discovery of concept entities from Web sites using web unit mining. Intl. J. of Web Inf. Sys. 1(3), 123–135 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Rodrigo Cánovas
    • 2
  • Francisco Claude
    • 3
  • Miguel A. Martínez-Prieto
    • 2
    • 4
  • Gonzalo Navarro
    • 2
  1. 1.Database LabUniversidade da CoruñaSpain
  2. 2.Department of Computer ScienceUniversity of ChileChile
  3. 3.School of Computer ScienceUniversity of WaterlooCanada
  4. 4.Department of Computer ScienceUniversidad de ValladolidSpain

Personalised recommendations