Dynamic Similarity for Fields with NULL Values

Zhao, Li; Yuan, Sung Sam; Yang, Qi Xiao; Peng, Sun

doi:10.1007/3-540-46145-0_16

Li Zhao⁷,
Sung Sam Yuan⁷,
Qi Xiao Yang⁸ &
…
Sun Peng⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2454))

Included in the following conference series:

International Conference on Data Warehousing and Knowledge Discovery

Abstract

One of the most important tasks in data cleansing is to deduplicate records, which needs to compare records to determine their equivalence. However, existing comparison methods, such as Record Similarity, Equational Theory, implicitly assume that the values in all fields are known, and NULL values are treated as empty strings, which will result in a loss of correct duplicate records. In this paper, we solve this problem by proposing a simple yet efficient method, Dynamic Similarity, which dynamically adjusts the similarity for field with NULL value. Performance results on real and synthetic datasets show that Dynamic Similarity method can achieve more correct duplicate records and without introducing more false positives as compared with Record Similarity. Furthermore, the percentage of correct duplicate records obtained by Dynamic Similarity but not obtained by Record Similarity will increase if the number of fields with NULL values increases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, pages 8(2):255–265, 1983.
Article MATH Google Scholar
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. In ACM SIGMOD Record, page 26 (1), 1997.
Google Scholar
M. Hernandez. A generalization of band joins and the merge/purge problem. Technical Report CUCS-005-1995, Columbia University, February 1996.
Google Scholar
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, pages 127–138, May 1995.
Google Scholar
M. L. Jarke, M. Vassiliou, and P. Vassiliadis. Fundamentals of data warehouses. Springer, 2000.
Google Scholar
R. Kimball. Dealing with dirty data. DBMS online, September 1996. Available from http://www.dbmsmag.com/9609d14.html.
M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 290–294, 2000.
Google Scholar
M. L. Lee, H. J. Lu, T. W. Ling, and Y. T. Ko. Cleansing data for mining and warehousing. In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), pages 751–760, 1999.
Google Scholar
Infoshare Limited. Best value guide to data standardizing. InfoDB, July 1998. Available from http://www.infoshare.ltd.uk.
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceeding of the ACMSIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, Tucson, AZ, 1997.
Google Scholar
A. Silberschatz, M. StoneBraker, and J. Ullman. Database research: Achievements and opportunities into the 21st century. In SIGMOD Record (ACM Special Interest Group on Management of Data), page 25(1):52, 1996.
Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, pages 147:195–197, 1981.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Li Zhao, Sung Sam Yuan & Sun Peng
Institute of High Performance of Computing, 89B Science Park Drive, 118261, Singapore
Qi Xiao Yang

Authors

Li Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Sung Sam Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Qi Xiao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Sun Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, 606-8501, Kyoto, Japan
Yahiko Kambayashi
Institute for Computer Science and Business Informatics, University of Vienna, Liebiggasse 4, 1010, Vienna, Austria
Werner Winiwarter
Center for Spatial Information Science (CSIS), University of Tokyo, 4-6-1, Komaba, Meguro-ku, 153-8904, Tokyo, Japan
Masatoshi Arikawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, L., Yuan, S.S., Yang, Q.X., Peng, S. (2002). Dynamic Similarity for Fields with NULL Values. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2002. Lecture Notes in Computer Science, vol 2454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46145-0_16

Download citation

DOI: https://doi.org/10.1007/3-540-46145-0_16
Published: 02 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44123-6
Online ISBN: 978-3-540-46145-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics