The Normalized Compression Distance as a Distance Measure in Entity Identification

Klenk, Sebastian; Thom, Dennis; Heidemann, Gunther

doi:10.1007/978-3-642-03067-3_26

Sebastian Klenk²⁰,
Dennis Thom²⁰ &
Gunther Heidemann²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5633))

Included in the following conference series:

Industrial Conference on Data Mining

1683 Accesses
3 Citations

Abstract

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible.

We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alfonseca, M., Cebrián, M., Ortega, A.: Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music. In: AIKED 2006: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Stevens Point, Wisconsin, USA, pp. 53–58. World Scientific and Engineering Academy and Society (WSEAS) (2006)
Google Scholar
Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: Detecting split identities of web authors. In: Stein, B., Koppel, M., Stamatatos, E. (eds.) PAN. CEUR Workshop Proceedings, vol. 276, CEUR-WS.org (2007)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM, New York (2003)
Google Scholar
Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE 96(4), 668–696 (2008)
Article Google Scholar
Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 53(5), 1895–1900 (2007)
Article MathSciNet MATH Google Scholar
Christen, P.: A two-step classification approach to unsupervised record linkage. In: AusDM 2007: Proceedings of the sixth Australasian conference on Data mining and analytics, Darlinghurst, Australia, pp. 111–119. Australian Computer Society, Inc. (2007)
Google Scholar
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)
Chapter Google Scholar
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51(4) (2005)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Feller, W.: An introduction to probability theory and its applications, vol. 1. Wiley, Chichester (1950)
MATH Google Scholar
Goiser, K., Christen, P.: Towards automated record linkage. In: AusDM 2006: Proceedings of the fifth Australasian conference on Data mining and analystics, Darlinghurst, Australia, pp. 23–31. Australian Computer Society, Inc. (2006)
Google Scholar
Han, J., Kamber, M.: Data mining. Morgan Kaufmann, San Francisco (2001)
MATH Google Scholar
Heidemann, G., Ritter, H.: On the Contribution of Compression to Visual Pattern Recognition. In: Proc. 3rd Int’l Conf. on Comp. Vision Theory and Applications, Funchal, Madeira - Portugal, vol. 2, pp. 83–89 (2008)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. John Wiley & Sons, Chichester (2004)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric (2001)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM Press, New York (2000)
Chapter Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)
Article MathSciNet MATH Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278. ACM Press, New York (2002)
Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. Technical Report RRS2006/02, US Bureau of the Census (2006)
Google Scholar
Yan, S., Lee, D., Kan, M.-Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 185–194. ACM Press, New York (2007)
Chapter Google Scholar
Zhao, H.: Semantic matching across heterogeneous data sources. Commun. ACM 50(1), 45–50 (2007)
Article Google Scholar
Zhao, H., Ram, S.: Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)
Article Google Scholar
Zhao, H., Ram, S.: Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering (corrected proof) (in press, 2008) (available online May 4)
Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Systems Group, Stuttgart University, Email: ais@vis.uni-stuttgart.de, Universitätsstrasse 38, 70569, Stuttgart, Germany
Sebastian Klenk, Dennis Thom & Gunther Heidemann

Authors

Sebastian Klenk
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Thom
View author publications
You can also search for this author in PubMed Google Scholar
Gunther Heidemann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klenk, S., Thom, D., Heidemann, G. (2009). The Normalized Compression Distance as a Distance Measure in Entity Identification. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2009. Lecture Notes in Computer Science(), vol 5633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03067-3_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-03067-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03066-6
Online ISBN: 978-3-642-03067-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics