Abstract
The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible.
We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alfonseca, M., Cebrián, M., Ortega, A.: Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music. In: AIKED 2006: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Stevens Point, Wisconsin, USA, pp. 53–58. World Scientific and Engineering Academy and Society (WSEAS) (2006)
Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: Detecting split identities of web authors. In: Stein, B., Koppel, M., Stamatatos, E. (eds.) PAN. CEUR Workshop Proceedings, vol. 276, CEUR-WS.org (2007)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM, New York (2003)
Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE 96(4), 668–696 (2008)
Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 53(5), 1895–1900 (2007)
Christen, P.: A two-step classification approach to unsupervised record linkage. In: AusDM 2007: Proceedings of the sixth Australasian conference on Data mining and analytics, Darlinghurst, Australia, pp. 111–119. Australian Computer Society, Inc. (2007)
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51(4) (2005)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Feller, W.: An introduction to probability theory and its applications, vol. 1. Wiley, Chichester (1950)
Goiser, K., Christen, P.: Towards automated record linkage. In: AusDM 2006: Proceedings of the fifth Australasian conference on Data mining and analystics, Darlinghurst, Australia, pp. 23–31. Australian Computer Society, Inc. (2006)
Han, J., Kamber, M.: Data mining. Morgan Kaufmann, San Francisco (2001)
Heidemann, G., Ritter, H.: On the Contribution of Compression to Visual Pattern Recognition. In: Proc. 3rd Int’l Conf. on Comp. Vision Theory and Applications, Funchal, Madeira - Portugal, vol. 2, pp. 83–89 (2008)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. John Wiley & Sons, Chichester (2004)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric (2001)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM Press, New York (2000)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278. ACM Press, New York (2002)
Winkler, W.E.: Overview of record linkage and current research directions. Technical Report RRS2006/02, US Bureau of the Census (2006)
Yan, S., Lee, D., Kan, M.-Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 185–194. ACM Press, New York (2007)
Zhao, H.: Semantic matching across heterogeneous data sources. Commun. ACM 50(1), 45–50 (2007)
Zhao, H., Ram, S.: Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)
Zhao, H., Ram, S.: Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering (corrected proof) (in press, 2008) (available online May 4)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klenk, S., Thom, D., Heidemann, G. (2009). The Normalized Compression Distance as a Distance Measure in Entity Identification. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2009. Lecture Notes in Computer Science(), vol 5633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03067-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-03067-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03066-6
Online ISBN: 978-3-642-03067-3
eBook Packages: Computer ScienceComputer Science (R0)