Skip to main content

Coreference Detection of Low Quality Objects

  • Conference paper
Advances on Computational Intelligence (IPMU 2012)

Abstract

The problem of record linkage is a widely studied problem that aims to identify coreferent (i.e. duplicate) data in a structured data source. As indicated by Winkler, a solution to the record linkage problem is only possible if the error rate is sufficiently low. In other words, in order to successfully de-duplicate a database, the objects in the database must be of sufficient quality. However, this assumption is not always feasible. In this paper, it is investigated how merging of low quality objects into one high quality object can improve the process of record linkage. This general idea is illustrated in the context of strings comparison, where strings of low quality (i.e. with a high typographical error rate) are merged into a string of high quality by using an n-dimensional Levenshtein distance matrix and compute the optimal alignment between the dirty strings. Results are presented and possible refinements are proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Tech. rep. (2002)

    Google Scholar 

  2. Bronselaer, A., Hallez, A., De Tré, G.: Extensions of fuzzy measures and the sugeno integral for possibilistic truth values. International Journal of Intelligent Systems 24(2), 97–117 (2009)

    Article  MATH  Google Scholar 

  3. Fellegi, I., Sunter, A.: A theory for record linkage. American Statistical Association Journal 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  4. Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  5. Lehti, P., Fankhauser, P.: Probabilistic Iterative Duplicate Detection. In: Meersman, R. (ed.) OTM 2005. LNCS, vol. 3761, pp. 1225–1242. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Tech. Rep. 8 (1966)

    Google Scholar 

  7. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 454–461. AUAI Press, Arlington (2004)

    Google Scholar 

  8. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Information Systems 26, 607–633 (2001)

    Article  MATH  Google Scholar 

  9. Winkler, W.E.: Methods for record linkage and bayesian networks. Tech. rep., Series RRS2002/05, U.S. Bureau of the Census (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nielandt, J., Bronselaer, A., De Tré, G. (2012). Coreference Detection of Low Quality Objects. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds) Advances on Computational Intelligence. IPMU 2012. Communications in Computer and Information Science, vol 297. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31709-5_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31709-5_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31708-8

  • Online ISBN: 978-3-642-31709-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics