String Distance Metrics for Reference Matching and Search Query Correction

Piskorski, Jakub; Sydow, Marcin

doi:10.1007/978-3-540-72035-5_27

Jakub Piskorski¹ &
Marcin Sydow²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4439))

Included in the following conference series:

International Conference on Business Information Systems

1920 Accesses
8 Citations

Abstract

String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Morton, T.: Coreference for NLP Applications. In: Proceedings of ACL 1997 (1997)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD2003 (2003)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 (2003)
Google Scholar
Piskorski, J.: Named Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–132. Springer, Heidelberg (2005)
Chapter Google Scholar
Linden, K.: Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9(3), 295–310 (2006)
Article Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. In: CSC ’93: Proceedings of the 1993 ACM conference on Computer science, Indianapolis, Indiana, United States, ACM Press, New York (1993)
Google Scholar
Levenshtein, V.: Binary Codes for Correcting Deletions, Insertions, and Reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)
MathSciNet Google Scholar
Landau, G., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1997)
Article MathSciNet Google Scholar
Needleman, S., Wunsch, C.: A General Method Applicable to Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Article Google Scholar
Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Winkle, W.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC (1999)
Google Scholar
Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science 92(1), 191–211 (1992)
Article MATH MathSciNet Google Scholar
Russell, R.: U.S. Patent 1,261,167 (1918), http://pattft.uspto.gov/netahtml/srchnum.htm
Monge, A., Elkan, C.: The Field Matching Problem: Algorithms and Applications. In: Proceedings of Knowledge Discovery and Data Mining 1996, pp. 267–270 (1996)
Google Scholar
NetSprint, http://www.netsprint.pl
Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of EMNLP 2004, pp. 293–300 (2004)
Google Scholar
Gravano, L., et al.: Using q-grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)
Google Scholar
Bilenko, M., Mooney, R.: Employing trainable string similarity metrics for information integration. In: Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico (August 2003)
Google Scholar
SimMetric, http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Download references

Author information

Authors and Affiliations

Joint Research Center of the European Commission, Web and Language Technology Group of IPSC, T.P. 267, Via Fermi 1, 21020 Ispra (VA), Italy
Jakub Piskorski
Polish-Japanese Institute of Information Technology (PJIIT), Department of Intelligent Systems, Koszykowa 86, 02-008 Warsaw, Poland
Marcin Sydow

Authors

Jakub Piskorski
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Sydow
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piskorski, J., Sydow, M. (2007). String Distance Metrics for Reference Matching and Search Query Correction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-72035-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72034-8
Online ISBN: 978-3-540-72035-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics