On Bit-Parallel Processing of Multi-byte Text

Hyyrö, Heikki; Takaba, Jun; Shinohara, Ayumi; Takeda, Masayuki

doi:10.1007/978-3-540-31871-2_25

Heikki Hyyrö²⁰,
Jun Takaba²¹,
Ayumi Shinohara^20,21 &
…
Masayuki Takeda^21,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Included in the following conference series:

Asia Information Retrieval Symposium

417 Accesses
3 Citations

Abstract

There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and σ is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers [25]. We find that the best choice is to use either the automaton-based method or a hash table.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18(6), 333–340 (1975)
Article MATH MathSciNet Google Scholar
Allison, A., Dix, T.L.: A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305–310 (1986)
Article MathSciNet Google Scholar
Baeza-Yates, R., Gonnet, G.: A new approach to text searching. Communications of the ACM 35(10), 74–82 (1992)
Article Google Scholar
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 20(10), 762–772 (1977)
Article Google Scholar
Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279–285 (2001)
Article MATH MathSciNet Google Scholar
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Oxford (1994)
MATH Google Scholar
Czumaj, A., Crochemore, M., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12, 247–267 (1994)
Article MATH MathSciNet Google Scholar
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964)
Article Google Scholar
Hyyrö, H.: Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Technical Report A-2001-10, Dept. of Computer and Information Sciences, University of Tampere, Tampere, Finland (2001)
Google Scholar
Hyyrö, H.: Bit-parallel approximate string matching with transposition. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 95–107. Springer, Heidelberg (2003)
Chapter Google Scholar
Hyyrö, H.: Bit-parallel LCS-length computation revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms, AWOCA 2004 (2004)
Google Scholar
Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, p. 203. Springer, Heidelberg (2002)
Chapter Google Scholar
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(1), 323–350 (1977)
Article MATH MathSciNet Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46(3), 395–415 (1999)
Article MATH MathSciNet Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G.: NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience 31, 1265–1312 (2001)
Article MATH Google Scholar
Navarro, G., Raffinot, M.: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithms 5(4) (2000)
Google Scholar
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002)
MATH Google Scholar
Robertson, A.M., Willett, P.: A comparison of spelling-correction methods for the identification of word forms in historical text databases. Literary and Linguistic Computing 8(3), 143–152 (1993)
Article Google Scholar
Sankoff, D., Kruskal, J. (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading (1983)
Google Scholar
Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukumachi, S., Shinohara, T., Arikawa, S.: Processing text files as is: Pattern matching over compressed tests, multi-byte character texts, and semi-structured tests. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, p. 170. Springer, Heidelberg (2002)
Chapter Google Scholar
Unicode Consortium.: Unicode Home Page, http://www.unicode.org/
Unicode Consortium.: The Unicode Standard 4.0. Addison-Wesley (2003)
Google Scholar
Wu, S., Manber, U., Myers, E.: A sub-quadratic algorithm for approximate limited expression matching. Algorithmica 15(1), 50–67 (1996)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

PRESTO, Japan Science and Technology Agency (JST),
Heikki Hyyrö & Ayumi Shinohara
Department of Informatics, Kyushu University 33, Fukuoka, 812-8581, Japan
Jun Takaba, Ayumi Shinohara & Masayuki Takeda
SORST, Japan Science and Technology Agency (JST),
Masayuki Takeda

Authors

Heikki Hyyrö
View author publications
You can also search for this author in PubMed Google Scholar
Jun Takaba
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng
The Key Laboratory of Power System Protection and Dynamic Security Monitoring and Control under Ministry of Education, North China Electric Power University, Zhuxinzhuang Dewai, 102206, Beijing, China
Ming Zhou
Department of Systems Engineering and Engineering Management, Shatin, The Chinese University of Hong Kong, Hong Kong, N.T.
Kam-Fai Wong
5F, Beijing Sigma Center, Microsoft Research Asia, No. 49 Zhichun Road Haidian District, 100080, Beijing, China
Hong-Jiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hyyrö, H., Takaba, J., Shinohara, A., Takeda, M. (2005). On Bit-Parallel Processing of Multi-byte Text. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-31871-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics