Skip to main content

String Distance Metrics for Reference Matching and Search Query Correction

  • Conference paper
Business Information Systems (BIS 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4439))

Included in the following conference series:

Abstract

String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Morton, T.: Coreference for NLP Applications. In: Proceedings of ACL 1997 (1997)

    Google Scholar 

  2. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD2003 (2003)

    Google Scholar 

  3. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 (2003)

    Google Scholar 

  4. Piskorski, J.: Named Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–132. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  5. Linden, K.: Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9(3), 295–310 (2006)

    Article  Google Scholar 

  6. Kukich, K.: Techniques for automatically correcting words in text. In: CSC ’93: Proceedings of the 1993 ACM conference on Computer science, Indianapolis, Indiana, United States, ACM Press, New York (1993)

    Google Scholar 

  7. Levenshtein, V.: Binary Codes for Correcting Deletions, Insertions, and Reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)

    MathSciNet  Google Scholar 

  8. Landau, G., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1997)

    Article  MathSciNet  Google Scholar 

  9. Needleman, S., Wunsch, C.: A General Method Applicable to Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48(3), 443–453 (1970)

    Article  Google Scholar 

  10. Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  11. Winkle, W.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC (1999)

    Google Scholar 

  12. Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  13. Russell, R.: U.S. Patent 1,261,167 (1918), http://pattft.uspto.gov/netahtml/srchnum.htm

  14. Monge, A., Elkan, C.: The Field Matching Problem: Algorithms and Applications. In: Proceedings of Knowledge Discovery and Data Mining 1996, pp. 267–270 (1996)

    Google Scholar 

  15. NetSprint, http://www.netsprint.pl

  16. Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of EMNLP 2004, pp. 293–300 (2004)

    Google Scholar 

  17. Gravano, L., et al.: Using q-grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)

    Google Scholar 

  18. Bilenko, M., Mooney, R.: Employing trainable string similarity metrics for information integration. In: Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico (August 2003)

    Google Scholar 

  19. SimMetric, http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Piskorski, J., Sydow, M. (2007). String Distance Metrics for Reference Matching and Search Query Correction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72035-5_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72034-8

  • Online ISBN: 978-3-540-72035-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics