Efficient String Similarity Search on Disks

Wang, Jinbao; Yang, Donghua

doi:10.1007/978-3-662-46248-5_7

Jinbao Wang¹⁸ &
Donghua Yang¹⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 503))

Included in the following conference series:

International Conference of Young Computer Scientists, Engineers and Educators

1974 Accesses

Abstract

String similarity search is a basic operation for various applications, such as data cleaning, spell checking, bioinformatics and information integration. Memory based q-gram inverted indexes fail to support string similarity search over large scale string datasets due to the memory limitation, and it can no longer work if the data size grows beyond the memory size. In the era of big data, large string dataset are quite common. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time, and LPA-Index is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports multiple filters to reduce query candidates effectively, and it adaptively reads inverted lists during query processing for better I/O performance. Experiment results demonstrate the efficiency of LPA-Index and its advantages over existing state-of-art disk index Behm-Index with regard to I/O cost and query response time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wagner, R., Fischer, M.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)
Article MathSciNet MATH Google Scholar
Behm, A., Chen, L., et al.: Answering approximate string queries on large data sets using external memory. In: Proc of IEEE ICDE 2011, pp. 888–899. IEEE Computer Society, Washington, DC (2011)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computer Survey 38(2), 6–20 (2006)
Article Google Scholar
Chen, L., Jiaheng, L., et al.: Efficient merging and filtering algorithms for approximate string searches. In: Proc of IEEE ICDE 2008, pp. 257–266. IEEE Computer Society, Washington DC (2008)
Google Scholar
Hadijieleftheriou, M., Koudas, N., et al.: Increamental maintenance of length normalized indexes for approximate string matching. In: Proc of ACM SIGMOD 2010, pp. 429–440. ACM, New York (2011)
Google Scholar
Jianbin, Q., Wei, W., et al.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: Proc of ACM SIGMOD 2011, pp. 1033–1044. ACM, New York (2011)
Google Scholar
Zhenjie, Z., Marios, H., Beng-Chin, O., et al.: Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In: Proc of SIGMOD 2010, pp. 915–926. ACM, New York (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

The Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin, 150080, China
Jinbao Wang & Donghua Yang

Authors

Jinbao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Donghua Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Hongzhi Wang & Wanxiang Che &
School of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, China
Haoliang Qi & Zhongyuan Han &
Northeast Forestry University, Harbin, China
Zhaowen Qiu
Heilongjiang Institute of Technology, Harbin, China
Leilei Kong
Harbin Engineering University, China
Junyu Lin
Zhongkeyunhai Company, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Yang, D. (2015). Efficient String Similarity Search on Disks. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-46248-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46247-8
Online ISBN: 978-3-662-46248-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics