Generalized Mongue-Elkan Method for Approximate Text String Comparison

Jimenez, Sergio; Becerra, Claudia; Gelbukh, Alexander; Gonzalez, Fabio

doi:10.1007/978-3-642-00382-0_45

Sergio Jimenez¹⁷,
Claudia Becerra¹⁷,
Alexander Gelbukh¹⁸ &
…
Fabio Gonzalez¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1878 Accesses
14 Citations
4 Altmetric

Abstract

The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (e.g. edit distance) combined with a token level (i.e. word level) similarity measure. We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method. The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

De Baets, B., De Meyer, H.: Transitivity-preserving fuzzification schemes for cardinality-based similarity measures. European Journal of Operational Research 160, 726–740 (2005)
Article MathSciNet MATH Google Scholar
Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison Wesley / ACM Press (1999)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18 (5), 16–23 (2003)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (2003)
Google Scholar
Christen, P.: A comparison of personal name matching: Techniques and practical issues. Technical report, The Australian National University, Department of Computer Science, Faculty of Engineering and Information Technology (2006)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web (2003)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Article Google Scholar
Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K.: Non-adjacent digrams improve matching of cross-lingual spelling variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 252–265. Springer, Heidelberg (2003)
Chapter Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2 2, 83–97 (1955)
Article MathSciNet MATH Google Scholar
Levenshtein, V.: Bynary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)
MATH Google Scholar
Michelson, M., Knoblock, C.A.: Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition 10(3), 211–226 (2007)
Article Google Scholar
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)
Google Scholar
Monge, A.: An adaptive and efficient algorithm for detecting approximately duplicate database records. International Journal on Information Systems Special Issue on Data Extraction, Cleaning, and Reconciliation (2001)
Google Scholar
Monge, A., Elkan, C.: The field matching problem: Algorithms and applications. In: Proceedings of The Second International Conference on Knowledge Discovery and Data Mining, (KDD) (1996)
Google Scholar
Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (2008)
Google Scholar
Pedersen, T., Pakhomov, S.V.S., Patwardhan, S., Chute, C.G.: Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40(3), 288–299 (2007)
Article Google Scholar
Piskorski, J., Sydow, M.: Usability of string distance metrics for name matching tasks in polish. In: Proceedings of the 3rd Language and Technology Conference, Poznan (2007)
Google Scholar
Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5) (1998)
Google Scholar
Ullmann, J.R.: A binary n-gram technique for automatic correction of substitution deletion, insertion and reversal errors in words. The Computer Journal 20(2), 141–147 (1977)
Article MATH Google Scholar
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)
Article MathSciNet MATH Google Scholar
Winkler, W., Thibaudeau, Y.: An application fo the fellegi-sunter model of record linkage to the 1990 us decenial census. Technical report, Bureau of the Census, Washington, D.C. (1991)
Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Systems Laboratory (LISI) Systems and Industrial Engineering Department, National University of Colombia, Colombia
Sergio Jimenez, Claudia Becerra & Fabio Gonzalez
Natural Language Laboratory Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico
Alexander Gelbukh

Authors

Sergio Jimenez
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Becerra
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F. (2009). Generalized Mongue-Elkan Method for Approximate Text String Comparison. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_45

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics