Abstract
The idea of LZ77 self-index has been proposed for repetitive text in compressed forms. Existing methods of approximate string matching based on LZ77 focus on space efficiency. We focus on how to efficiently search similar strings in text without decompressing the whole text. We propose RS-search algorithm to merge all the occurrences of substring efficiently to narrow down the potential region and design novel filterings to reduce the scale of candidates. The experiments show that our algorithm achieves outstanding performance and an interesting time-space trade-off in approximate matching for compressed string.
The work is partially supported by the NSF of China for Outstanding Young Scholars (No. 61322208), the NSF of China (Nos. 61272178, 61572122), and the NSF of China for Key Program (No. 61532021).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
\(rank_{b}(B,\; i) \) is the number of occurrences of bit b in \(B_{1,\;i}\).
- 2.
References
Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. (TODS) 38(3), 16 (2013)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–239 (2000)
Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: ACM Sigmod International Conference on Management of Data, pp. 673–684 (2014)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 257–266 (2008)
Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)
Wandelt, S., Leser, U.: MRCSI: compressing and searching string collections with multiple references. PVLDB 8(5), 461–472 (2015)
Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013)
Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)
Gagie, T., Gawrychowski, P., Puglisi, S.J.: Approximate pattern matching in LZ77-compressed texts. J. Discrete Algorithms 32, 64–68 (2014)
Russo, L.M.S., Navarro, G., Oliveira, A.L., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)
Bille, P., Fagerberg, R., Li Gørtz, I.: Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 52–62. Springer, Heidelberg (2007)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(85), 19–27 (2001)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Approximate string matching with Lempel-Ziv compressed indexes. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 264–275. Springer, Heidelberg (2007)
Levenstein, V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1(1), 8–17 (1965)
Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Han, Y., Wang, B., Yang, X. (2016). Efficient Approximate Substring Matching in Compressed String. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9659. Springer, Cham. https://doi.org/10.1007/978-3-319-39958-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-39958-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39957-7
Online ISBN: 978-3-319-39958-4
eBook Packages: Computer ScienceComputer Science (R0)