Abstract
String similarity search is a basic operation for various applications, such as data cleaning, spell checking, bioinformatics and information integration. Memory based q-gram inverted indexes fail to support string similarity search over large scale string datasets due to the memory limitation, and it can no longer work if the data size grows beyond the memory size. In the era of big data, large string dataset are quite common. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time, and LPA-Index is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports multiple filters to reduce query candidates effectively, and it adaptively reads inverted lists during query processing for better I/O performance. Experiment results demonstrate the efficiency of LPA-Index and its advantages over existing state-of-art disk index Behm-Index with regard to I/O cost and query response time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Wagner, R., Fischer, M.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)
Behm, A., Chen, L., et al.: Answering approximate string queries on large data sets using external memory. In: Proc of IEEE ICDE 2011, pp. 888–899. IEEE Computer Society, Washington, DC (2011)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computer Survey 38(2), 6–20 (2006)
Chen, L., Jiaheng, L., et al.: Efficient merging and filtering algorithms for approximate string searches. In: Proc of IEEE ICDE 2008, pp. 257–266. IEEE Computer Society, Washington DC (2008)
Hadijieleftheriou, M., Koudas, N., et al.: Increamental maintenance of length normalized indexes for approximate string matching. In: Proc of ACM SIGMOD 2010, pp. 429–440. ACM, New York (2011)
Jianbin, Q., Wei, W., et al.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: Proc of ACM SIGMOD 2011, pp. 1033–1044. ACM, New York (2011)
Zhenjie, Z., Marios, H., Beng-Chin, O., et al.: Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In: Proc of SIGMOD 2010, pp. 915–926. ACM, New York (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, J., Yang, D. (2015). Efficient String Similarity Search on Disks. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-46248-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46247-8
Online ISBN: 978-3-662-46248-5
eBook Packages: Computer ScienceComputer Science (R0)