Skip to main content

The Normalized Compression Distance as a Distance Measure in Entity Identification

  • Conference paper
Advances in Data Mining. Applications and Theoretical Aspects (ICDM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5633))

Included in the following conference series:

Abstract

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible.

We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alfonseca, M., Cebrián, M., Ortega, A.: Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music. In: AIKED 2006: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Stevens Point, Wisconsin, USA, pp. 53–58. World Scientific and Engineering Academy and Society (WSEAS) (2006)

    Google Scholar 

  2. Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: Detecting split identities of web authors. In: Stein, B., Koppel, M., Stamatatos, E. (eds.) PAN. CEUR Workshop Proceedings, vol. 276, CEUR-WS.org (2007)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM, New York (2003)

    Google Scholar 

  5. Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE 96(4), 668–696 (2008)

    Article  Google Scholar 

  6. Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 53(5), 1895–1900 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  7. Christen, P.: A two-step classification approach to unsupervised record linkage. In: AusDM 2007: Proceedings of the sixth Australasian conference on Data mining and analytics, Darlinghurst, Australia, pp. 111–119. Australian Computer Society, Inc. (2007)

    Google Scholar 

  8. Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51(4) (2005)

    Google Scholar 

  10. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  11. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  12. Feller, W.: An introduction to probability theory and its applications, vol. 1. Wiley, Chichester (1950)

    MATH  Google Scholar 

  13. Goiser, K., Christen, P.: Towards automated record linkage. In: AusDM 2006: Proceedings of the fifth Australasian conference on Data mining and analystics, Darlinghurst, Australia, pp. 23–31. Australian Computer Society, Inc. (2006)

    Google Scholar 

  14. Han, J., Kamber, M.: Data mining. Morgan Kaufmann, San Francisco (2001)

    MATH  Google Scholar 

  15. Heidemann, G., Ritter, H.: On the Contribution of Compression to Visual Pattern Recognition. In: Proc. 3rd Int’l Conf. on Comp. Vision Theory and Applications, Funchal, Madeira - Portugal, vol. 2, pp. 83–89 (2008)

    Google Scholar 

  16. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  17. Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. John Wiley & Sons, Chichester (2004)

    Google Scholar 

  18. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric (2001)

    Google Scholar 

  19. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM Press, New York (2000)

    Chapter  Google Scholar 

  20. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  21. Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  22. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278. ACM Press, New York (2002)

    Google Scholar 

  23. Winkler, W.E.: Overview of record linkage and current research directions. Technical Report RRS2006/02, US Bureau of the Census (2006)

    Google Scholar 

  24. Yan, S., Lee, D., Kan, M.-Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 185–194. ACM Press, New York (2007)

    Chapter  Google Scholar 

  25. Zhao, H.: Semantic matching across heterogeneous data sources. Commun. ACM 50(1), 45–50 (2007)

    Article  Google Scholar 

  26. Zhao, H., Ram, S.: Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)

    Article  Google Scholar 

  27. Zhao, H., Ram, S.: Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering (corrected proof) (in press, 2008) (available online May 4)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Klenk, S., Thom, D., Heidemann, G. (2009). The Normalized Compression Distance as a Distance Measure in Entity Identification. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2009. Lecture Notes in Computer Science(), vol 5633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03067-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03067-3_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03066-6

  • Online ISBN: 978-3-642-03067-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics