Coreference Detection of Low Quality Objects

Nielandt, Joachim; Bronselaer, Antoon; De Tré, Guy

doi:10.1007/978-3-642-31709-5_46

Joachim Nielandt⁶,
Antoon Bronselaer⁶ &
Guy De Tré⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 297))

Included in the following conference series:

International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems

684 Accesses

Abstract

The problem of record linkage is a widely studied problem that aims to identify coreferent (i.e. duplicate) data in a structured data source. As indicated by Winkler, a solution to the record linkage problem is only possible if the error rate is sufficiently low. In other words, in order to successfully de-duplicate a database, the objects in the database must be of sufficient quality. However, this assumption is not always feasible. In this paper, it is investigated how merging of low quality objects into one high quality object can improve the process of record linkage. This general idea is illustrated in the context of strings comparison, where strings of low quality (i.e. with a high typographical error rate) are merged into a string of high quality by using an n-dimensional Levenshtein distance matrix and compute the optimal alignment between the dirty strings. Results are presented and possible refinements are proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Tech. rep. (2002)
Google Scholar
Bronselaer, A., Hallez, A., De Tré, G.: Extensions of fuzzy measures and the sugeno integral for possibilistic truth values. International Journal of Intelligent Systems 24(2), 97–117 (2009)
Article MATH Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. American Statistical Association Journal 64(328), 1183–1210 (1969)
Article Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Lehti, P., Fankhauser, P.: Probabilistic Iterative Duplicate Detection. In: Meersman, R. (ed.) OTM 2005. LNCS, vol. 3761, pp. 1225–1242. Springer, Heidelberg (2005)
Chapter Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Tech. Rep. 8 (1966)
Google Scholar
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 454–461. AUAI Press, Arlington (2004)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Information Systems 26, 607–633 (2001)
Article MATH Google Scholar
Winkler, W.E.: Methods for record linkage and bayesian networks. Tech. rep., Series RRS2002/05, U.S. Bureau of the Census (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Document, Database and Content Management, Ghent University, Belgium
Joachim Nielandt, Antoon Bronselaer & Guy De Tré

Authors

Joachim Nielandt
View author publications
You can also search for this author in PubMed Google Scholar
Antoon Bronselaer
View author publications
You can also search for this author in PubMed Google Scholar
Guy De Tré
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Economics, University of Catania, Corso Italia, 55, 95129, Catania, Italy
Salvatore Greco & Benedetto Matarazzo &
Université Pierre et Marie Curie - Paris6, CNRS UMR 7606, DAPA, LIP6 8, rue du Capitaine Scott, F-75015, Paris, France
Bernadette Bouchon-Meunier
Dip. Matematica e Informatica, Università di Perugia, 06123, Perugia, Italy
Giulianella Coletti
Department of Computer and Management Science, University of Trento, Via Inama 5, 38122, Trento, Italy
Mario Fedrizzi
Machine Intelligence Institute, IONA College, 10801, New Rochelle, NY, USA
Ronald R. Yager

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nielandt, J., Bronselaer, A., De Tré, G. (2012). Coreference Detection of Low Quality Objects. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds) Advances on Computational Intelligence. IPMU 2012. Communications in Computer and Information Science, vol 297. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31709-5_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-31709-5_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31708-8
Online ISBN: 978-3-642-31709-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics