Advertisement

Enumerated Automata Implementation of String Dictionaries

  • Robert Bakarić
  • Damir Korenčić
  • Strahil RistovEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11601)

Abstract

Over the last decade a considerable effort was invested into research on implementing string dictionaries. String dictionary is a data structure that bijectively maps a set of strings to a set of integers, and that is used in various index-based applications. A recent paper [18] can be regarded as a reference work on the subject of string dictionary implementations. Although very comprehensive, [18] does not cover the implementation of a string dictionary with the enumerated deterministic finite automaton, a data structure naturally suited for this purpose. We compare the results for the state-of-the-art compressed enumerated automaton with those presented in [18] on the same collection of data sets, and on the collection of natural language word lists. We show that our string dictionary implementation is a competitive variant for different types of data, especially when dealing with large sets of strings, and when strings have more similarity between them. In particular, our method presents as a prominent solution for storing DNA motifs and words of inflected natural languages. We provide the code used for the experiments.

Keywords

String dictionary Enumerated DFA Recursive automaton LZ trie DNA indexing 

Notes

Acknowledgment

We are grateful to Miguel Martínez-Prieto for kindly providing data sets used in [18].

References

  1. 1.
    Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Blelloch, G.E., Halperin, D. (eds.) ALENEX 2010, pp. 84–97. SIAM, Philadelphia (2010).  https://doi.org/10.1137/1.9781611972900.9CrossRefGoogle Scholar
  2. 2.
    Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: DCC 2014, pp. 322–331. IEEE (2014).  https://doi.org/10.1109/DCC.2014.36
  3. 3.
    Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 136–147. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-20662-7_12CrossRefGoogle Scholar
  5. 5.
    Daciuk, J., van Noord, G.: Finite automata for compact representation of language models in NLP. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 65–73. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-36390-4_6CrossRefzbMATHGoogle Scholar
  6. 6.
    Daciuk, J., van Noord, G.: Finite automata for compact representation of tuple dictionaries. Theor. Comput. Sci. 313(1), 45–56 (2004)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Daciuk, J.: Experiments with automata compression. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 105–112. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-44674-5_8CrossRefzbMATHGoogle Scholar
  8. 8.
    Daciuk, J., Piskorski, J.: Gazetteer compression technique based on substructure recognition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) IIPWM 2006. AINSC, vol. 35, pp. 87–95. Springer, Heidelberg (2006).  https://doi.org/10.1007/3-540-33521-8_9CrossRefGoogle Scholar
  9. 9.
    Daciuk, J., Piskorski, J., Ristov, S.: Natural language dictionaries implemented as finite automata. In: Martín-Vide, C. (ed.) Mathematics, Computing, Language, and Life: Frontiers in Mathematical Linguistics and Language Theory, vol. 2, pp. 133–204. World Scientific & Imperial College Press, London (2010)zbMATHGoogle Scholar
  10. 10.
    Daciuk, J., Weiss, D.: Smaller representation of finite state automata. In: Bouchou-Markhoff, B., Caron, P., Champarnaud, J.-M., Maurel, D. (eds.) CIAA 2011. LNCS, vol. 6807, pp. 118–129. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22256-6_12CrossRefzbMATHGoogle Scholar
  11. 11.
    Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: PODS 2008, pp. 181–190. ACM, New York (2008).  https://doi.org/10.1145/1376916.1376943
  12. 12.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005, pp. 184–196. IEEE Computer Society (2005).  https://doi.org/10.1109/SFCS.2005.69
  13. 13.
    Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10:1–10:21 (2010).  https://doi.org/10.1145/1868237.1868248MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Georgiev, K.: Compression of minimal acyclic deterministic FSAs preserving the linear accepting complexity. In: Mihov, S., Schulz, K.U. (eds.) Proceedings Workshop on Finite-State Techniques and Approximate Search 2007, pp. 7–13 (2007)Google Scholar
  15. 15.
    Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19(1), 3.4:1.1–3.4:1.20 (2014)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000).  https://doi.org/10.1109/5.892708CrossRefGoogle Scholar
  17. 17.
    Lucchesi, C.L., Kowaltowski, T.: Applications of finite automata representing large vocabularies. Softw. Pract. Exp. 23(1), 15–30 (1993)CrossRefGoogle Scholar
  18. 18.
    Martínez-Prieto, M.A., Brisaboa, N., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56(C), 73–108 (2016)CrossRefGoogle Scholar
  19. 19.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discret. Algorithms 2(1), 87–114 (2004).  https://doi.org/10.1016/S1570-8667(03)00066-2MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Eppstein, D. (ed.) Proceedings of SODA 2002, pp. 233–242. ACM/SIAM, Philadelphia (2002)Google Scholar
  21. 21.
    Revuz, D.: Dictionnaires et lexiques: méthodes et algorithmes. Ph.D. thesis, Institut Blaise Pascal, Paris, France (1991)Google Scholar
  22. 22.
    Ristov, S.: LZ trie and dictionary compression. Softw. Pract. Exp. 35(5), 445–465 (2005).  https://doi.org/10.1002/spe.643CrossRefGoogle Scholar
  23. 23.
    Ristov, S., Korenčić, D.: Fast construction of space-optimized recursive automaton. Softw. Pract. Exp. 45(6), 783–799 (2014).  https://doi.org/10.1002/spe.2261CrossRefGoogle Scholar
  24. 24.
    Ristov, Strahil, Laporte, Eric: Ziv Lempel compression of huge natural language data tries using suffix arrays. In: Crochemore, Maxime, Paterson, Mike (eds.) CPM 1999. LNCS, vol. 1645, pp. 196–211. Springer, Heidelberg (1999).  https://doi.org/10.1007/3-540-48452-3_15CrossRefGoogle Scholar
  25. 25.
    Skibiński, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005).  https://doi.org/10.1002/spe.678CrossRefGoogle Scholar
  26. 26.
    Tounsi, L., Bouchou, B., Maurel, D.: A compression method for natural language automata. In: FSMNLP 2008, pp. 146–157. IOS Press, Amsterdam (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Robert Bakarić
    • 1
  • Damir Korenčić
    • 1
  • Strahil Ristov
    • 1
    Email author
  1. 1.Department of ElectronicsRuđer Bošković InstituteZagrebCroatia

Personalised recommendations