Duplicate Identification in Deep Web Data Integration

Liu, Wei; Meng, Xiaofeng; Yang, Jianwu; Xiao, Jianguo

doi:10.1007/978-3-642-14246-8_4

Wei Liu²⁰,
Xiaofeng Meng²¹,
Jianwu Yang²⁰ &
…
Jianguo Xiao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

International Conference on Web-Age Information Management

1672 Accesses
2 Citations

Abstract

Duplicate identification is a critical step in deep web data integration, and generally, this task has to be performed over multiple web databases. However, a customized matcher for two web databases often does not work well for other two ones due to various presentations and different schemas. It is not practical to build and maintain \(C^{2}_{n}\) matchers for n web databases. In this paper, we target at building one universal matcher over multiple web databases in one domain. According to our observation, the similarity on an attribute is dependent of those of some other attributes, which is ignored by existing approaches. Inspired by this, we propose a comprehensive solution for duplicate identification problem over multiple web databases. The extensive experiments over real web databases on three domains show the proposed solution is an effective way to address the duplicate identification problem over multiple web databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB 2006 (2006)
Google Scholar
Bilenko, M., Mooney, R.J., Cohen, W.W.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18(5) (2003)
Google Scholar
Bayardo, R.J., Ma, Y.: Scaling up all pairs similarity search. In: WWW 2007 (2007)
Google Scholar
Cohen, W.W.: Data Integration Using Similarity Joins and a Word-Based Information Representation Language. ACM Trans. Information Systems (3) (2000)
Google Scholar
Chaudhuri, S., Chen, B., Ganti, V.: Example-driven design of efficient record matching queries. In: VLDB 2007 (2007)
Google Scholar
Chang, K.C., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the web: Observations and Implications. SIGMOD Record 33(3) (2004)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1) (2007)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: VLDB 2001 (2001)
Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: SIGMOD 2006 (2006)
Google Scholar
Poon, H., Domingos, P.: Joint inference in information extraction. In: AAAI 2007 (2007)
Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2) (2006)
Google Scholar
http://www.cs.utexas.edu/users/ml/riddle/index.html
Shen, W., DeRose, P., Vu, L.: Source-aware Entity Matching: A Compositional Approach. In: ICDE 2007 (2007)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD (2004)
Google Scholar
Smith, T.-F., Waterman, M.-S.: Identification of common molecular subsequences. Journal of Molecular Biology (1981)
Google Scholar
Winkler, W.E.: Methods for Record Linkage and Bayesian Networks. Technical Report Statistical Research Report Series RRS2002/05, US Bureau of the Census (2002)
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. US Bureau of Census (1999)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Efficient similarity joins for near duplicate detection. In: WWW 2008 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science & Technology, Peking University, Beijing, China
Wei Liu, Jianwu Yang & Jianguo Xiao
School of Information, Renmin University of China, Beijing, China
Xiaofeng Meng

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
Computer Department, Sichuan University, 610064, Chengdu, China
Changjie Tang
Department of Computer Science, Duke University, Box 90129, NC 27708-0129, Durham, USA
Jun Yang
College of Computer Science, Zhejiang University, 388 Yuhangtang Road, 310058, Hangzhou, China
Yunjun Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Meng, X., Yang, J., Xiao, J. (2010). Duplicate Identification in Deep Web Data Integration. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-14246-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics