Skip to main content

On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6193))

Included in the following conference series:

  • 675 Accesses

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, pp. 586–597 (2002)

    Google Scholar 

  2. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-Theoretic Tools for Mining Database Structure from Large Data Sets. In: Proceedings of ACM SIGMOD 2004, Paris, France, pp. 731–742 (2004)

    Google Scholar 

  3. Bilenko, M., Mooney, R.J.: On Evaluation and Training-Set Construction for Duplicate Detection. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, August 2003, pp. 7–12 (2003)

    Google Scholar 

  4. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: Proceedings of ACM SIGMOD 2003, San Diego, USA, pp. 313–324 (2003)

    Google Scholar 

  5. English, L.P.: Improving Data Warehouse and Business Information Quality. J. Wiley and Sons, New York (1999)

    Google Scholar 

  6. Hernandez, M.: A Generation of Band Joins and the Merge/Purge Problem. Technical Report CUCS-005-1995, Columbia University (February 1995)

    Google Scholar 

  7. Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proceedings of the 1995 ACM-SIGMOD International Conference on Management of Data, pp. 127–138 (1995)

    Google Scholar 

  8. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text Joins for Data Cleansing and Integration in an RDBMS. In: Proceedings of ICDE 2003, Bangalore, India, pp. 729–731 (2003)

    Google Scholar 

  9. Low, W.L., Lee, M.L., Ling, T.W.: A Knowledge-Based Framework for Duplicates Elimination. Information Systems: Special Issue on Data Extraction, Cleaning and Reconciliation 26(8) (2001)

    Google Scholar 

  10. Monge, A.E., Elkan, C.P.: An Efficient Domain-independent Algorithm for detecting Approximately Duplicate Database Records. In: Proceedings of SIDGMOD Workshop on Research issues and Data Mining and Knowledge Discovery (1997)

    Google Scholar 

  11. Monge, A.E., Elkan, C.P.: The Field Matching Problem: Algorithms and Application. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (SIGKDD 1996), pp. 267–270 (1996)

    Google Scholar 

  12. Li, Z., Sung, S.Y., Sun, P., Ling, T.W.: A New Efficient Data Cleansing Method. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds.) DEXA 2002. LNCS, vol. 2453, p. 484. Springer, Heidelberg (2002)

    Google Scholar 

  13. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  14. Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: Proceedings of Conference on Information and Knowledge Management (CIKM 2002), pp. 76–83 (2002)

    Google Scholar 

  15. Tian, Z., Lu, H., Ji, W., Zhou, A., Tian, Z.: An N-gram-based Approach for Detecting Approximately Duplicate Database Records. International Journal of Digital Library 3, 325–331 (2002)

    Article  Google Scholar 

  16. Weis, M., Naumann, F.: Detecting Duplicate Objects in XML Documents. In: Proceedings of IQIS 2004, Paris, France, pp. 10–19 (2004)

    Google Scholar 

  17. Zhang, J., Ling, T.W., Bruckner, R.M., Liu, H.: PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 486–496. Springer, Heidelberg (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, J., Shu, Y., Wang, H. (2010). On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources. In: Yoshikawa, M., Meng, X., Yumoto, T., Ma, Q., Sun, L., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 6193. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14589-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14589-6_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14588-9

  • Online ISBN: 978-3-642-14589-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics