On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources

Zhang, Ji; Shu, Yanfeng; Wang, Hua

doi:10.1007/978-3-642-14589-6_14

Ji Zhang²²,
Yanfeng Shu²³ &
Hua Wang²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6193))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

675 Accesses

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, pp. 586–597 (2002)
Google Scholar
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-Theoretic Tools for Mining Database Structure from Large Data Sets. In: Proceedings of ACM SIGMOD 2004, Paris, France, pp. 731–742 (2004)
Google Scholar
Bilenko, M., Mooney, R.J.: On Evaluation and Training-Set Construction for Duplicate Detection. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, August 2003, pp. 7–12 (2003)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: Proceedings of ACM SIGMOD 2003, San Diego, USA, pp. 313–324 (2003)
Google Scholar
English, L.P.: Improving Data Warehouse and Business Information Quality. J. Wiley and Sons, New York (1999)
Google Scholar
Hernandez, M.: A Generation of Band Joins and the Merge/Purge Problem. Technical Report CUCS-005-1995, Columbia University (February 1995)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proceedings of the 1995 ACM-SIGMOD International Conference on Management of Data, pp. 127–138 (1995)
Google Scholar
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text Joins for Data Cleansing and Integration in an RDBMS. In: Proceedings of ICDE 2003, Bangalore, India, pp. 729–731 (2003)
Google Scholar
Low, W.L., Lee, M.L., Ling, T.W.: A Knowledge-Based Framework for Duplicates Elimination. Information Systems: Special Issue on Data Extraction, Cleaning and Reconciliation 26(8) (2001)
Google Scholar
Monge, A.E., Elkan, C.P.: An Efficient Domain-independent Algorithm for detecting Approximately Duplicate Database Records. In: Proceedings of SIDGMOD Workshop on Research issues and Data Mining and Knowledge Discovery (1997)
Google Scholar
Monge, A.E., Elkan, C.P.: The Field Matching Problem: Algorithms and Application. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (SIGKDD 1996), pp. 267–270 (1996)
Google Scholar
Li, Z., Sung, S.Y., Sun, P., Ling, T.W.: A New Efficient Data Cleansing Method. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds.) DEXA 2002. LNCS, vol. 2453, p. 484. Springer, Heidelberg (2002)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: Proceedings of Conference on Information and Knowledge Management (CIKM 2002), pp. 76–83 (2002)
Google Scholar
Tian, Z., Lu, H., Ji, W., Zhou, A., Tian, Z.: An N-gram-based Approach for Detecting Approximately Duplicate Database Records. International Journal of Digital Library 3, 325–331 (2002)
Article Google Scholar
Weis, M., Naumann, F.: Detecting Duplicate Objects in XML Documents. In: Proceedings of IQIS 2004, Paris, France, pp. 10–19 (2004)
Google Scholar
Zhang, J., Ling, T.W., Bruckner, R.M., Liu, H.: PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 486–496. Springer, Heidelberg (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computing, The University of Southern Queensland, Australia
Ji Zhang & Hua Wang
CSIRO ICT Centre, Hobart, Australia
Yanfeng Shu

Authors

Ji Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanfeng Shu
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo, 606-8501, Kyoto, Japan
Masatoshi Yoshikawa
Information School, Renmin University of China, 100872, Beijing, China
Xiaofeng Meng
Graduate School of Engineering, University of Hyogo, 2167 Shosha, Himeji, 671-2280, Hyogo, Japan
Takayuki Yumoto
Graduate School of Informatics, Kyoto University, Yoshidahonmachi, Sakyo, 606-8501, Kyoto, Japan
Qiang Ma
Institute of HCI and Media Integration, Tsinghua University, 100084, Bejing, China
Lifeng Sun
Department of Information Science, Ochanomizu University, 2-1-1, Otsuka, Bunkyo-ku, 112-8610, Tokyo, Japan
Chiemi Watanabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Shu, Y., Wang, H. (2010). On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources. In: Yoshikawa, M., Meng, X., Yumoto, T., Ma, Q., Sun, L., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 6193. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14589-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-14589-6_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14588-9
Online ISBN: 978-3-642-14589-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics