Skip to main content
Log in

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. (J ACM 34(3):578–595, 1987). These equivalence classes were originally proposed to define a text indexing structure called compact directed acyclic word graphs (CDAWGs). These equivalence classes are also useful for text analysis, since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate these equivalence classes using only suffix arrays and two auxiliary arrays (rank arrays and lcp arrays), in O(n) time for a given string of length n over the integer alphabet. The proposed method overcomes all the existing algorithms which require \(O(n \log \sigma )\) time, where \(\sigma \) is the alphabet size. Our experimental results show that the proposed method is also practically faster and more memory efficient than the existing ones. Furthermore, we propose an O(n)-time algorithm which constructs the CDAWG of an input string over the integer alphabet. This algorithm is based on the above-mentioned algorithm to enumerate equivalence classes. Our experiments show that the proposed method runs faster than the existing algorithm on large alphabets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

  1. Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  2. Blumer, A., Blumer, J., Haussler, D., Mcconnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  3. Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs.In: Proceedings of CPM 1997, pp. 116–129 (1997)

  4. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  5. Inenaga, S., Hoshinoa, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discret. Appl. Math. 146(2), 156–179 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53, 918–936 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  7. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of CPM’01, pp. 181–192 (2001)

  8. Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discret. Algorithms 3, 126–142 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discret. Algorithms 3, 143–156 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  11. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  12. Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Proceedings of DS’07, pp. 159–172 (2007)

  13. Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. In: Proceedings of CPM’07, pp. 340–351 (2007)

  14. Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Okanohara, D., Tsujii, J.: Text categorization with all substring features. In: Proceedings of SDM’09, pp. 838–846 (2009)

  16. Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of PSC’08, pp. 161–169 (2008)

  17. Revuz, D.: Minimisation of acyclic deterministic automata in linear time. Theor. Comput. Sci. 92(1), 181–189 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  18. sais. https://sites.google.com/site/yuta256/ Accessed 30 Oct 2013

  19. Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I.: Discovering characteristic expressions in literary works. Theor. Comput. Sci. 292(2), 525–546 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  20. The canterbury corpus. http://corpus.canterbury.ac.nz/ Accessed 30 Oct 2013

  21. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  22. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of 14th IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

Download references

Acknowledgments

This work was partially supported by KAKENHI Grant Number 25240003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuyuki Narisawa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Narisawa, K., Hiratsuka, H., Inenaga, S. et al. Efficient Computation of Substring Equivalence Classes with Suffix Arrays. Algorithmica 79, 291–318 (2017). https://doi.org/10.1007/s00453-016-0178-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-016-0178-z

Keywords

Navigation