Skip to main content

On Bit-Parallel Processing of Multi-byte Text

  • Conference paper
Information Retrieval Technology (AIRS 2004)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Included in the following conference series:

Abstract

There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and σ is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers [25]. We find that the best choice is to use either the automaton-based method or a hash table.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18(6), 333–340 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  2. Allison, A., Dix, T.L.: A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305–310 (1986)

    Article  MathSciNet  Google Scholar 

  3. Baeza-Yates, R., Gonnet, G.: A new approach to text searching. Communications of the ACM 35(10), 74–82 (1992)

    Article  Google Scholar 

  4. Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 20(10), 762–772 (1977)

    Article  Google Scholar 

  5. Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279–285 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  6. Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Oxford (1994)

    MATH  Google Scholar 

  7. Czumaj, A., Crochemore, M., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12, 247–267 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  8. Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  9. Hyyrö, H.: Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Technical Report A-2001-10, Dept. of Computer and Information Sciences, University of Tampere, Tampere, Finland (2001)

    Google Scholar 

  10. Hyyrö, H.: Bit-parallel approximate string matching with transposition. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 95–107. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  11. Hyyrö, H.: Bit-parallel LCS-length computation revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms, AWOCA 2004 (2004)

    Google Scholar 

  12. Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, p. 203. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(1), 323–350 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  14. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  15. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46(3), 395–415 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  16. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  17. Navarro, G.: NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience 31, 1265–1312 (2001)

    Article  MATH  Google Scholar 

  18. Navarro, G., Raffinot, M.: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithms 5(4) (2000)

    Google Scholar 

  19. Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002)

    MATH  Google Scholar 

  20. Robertson, A.M., Willett, P.: A comparison of spelling-correction methods for the identification of word forms in historical text databases. Literary and Linguistic Computing 8(3), 143–152 (1993)

    Article  Google Scholar 

  21. Sankoff, D., Kruskal, J. (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading (1983)

    Google Scholar 

  22. Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukumachi, S., Shinohara, T., Arikawa, S.: Processing text files as is: Pattern matching over compressed tests, multi-byte character texts, and semi-structured tests. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, p. 170. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  23. Unicode Consortium.: Unicode Home Page, http://www.unicode.org/

  24. Unicode Consortium.: The Unicode Standard 4.0. Addison-Wesley (2003)

    Google Scholar 

  25. Wu, S., Manber, U., Myers, E.: A sub-quadratic algorithm for approximate limited expression matching. Algorithmica 15(1), 50–67 (1996)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hyyrö, H., Takaba, J., Shinohara, A., Takeda, M. (2005). On Bit-Parallel Processing of Multi-byte Text. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31871-2_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25065-4

  • Online ISBN: 978-3-540-31871-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics