Rebuilding the World from Views

Zhou, Xiaofang; Köhler, Henning

doi:10.1007/978-3-642-14246-8_2

Xiaofang Zhou²⁰ &
Henning Köhler²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

International Conference on Web-Age Information Management

1648 Accesses

Abstract

With the ever-increasing growth of the internet, more and more data sets are being made available. Most of this data has its origin in the real world, often describing the same objects or events from different viewpoints. One can thus consider data sets obtained from different sources as different (and possibly inconsistent) views of our world, and it makes sense to try to integrate them in some form, e.g. to answer questions which involve data from multiple sources. While data integration is an old and well-investigated subject, the nature of the data sets to be integrated is changing. They increase in volume as well as complexity, are often undocumented, relationships between data sets are more fuzzy, and representations of the same real-word object differ. To address these challenges, new methods for rapid, semi-automatic, loose and virtual integration, exploration and querying of large families of data sets must be developed.

In an ongoing project we are investigating a framework for sampling and matching data sets in an efficient manner. In particular, we consider the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes [1]. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. We deal with the issue of different representation of objects, i.e., ‘dirty’ data, by employing new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between string instances. To make the measures effective, especially in light of data sets being large and distributed, we developed efficient algorithms for distributed sample creation and similarity computation. Central to this is that sampling is synchronized. For clean data this means that the same values are sampled for each set, if present [2,3]. For dirty data one must ensure that similar values are sampled for each set, if present, and we manage to do so in a probabilistic manner.

The next step of our research is to extend such a sampling and matching approach to multiple attributes and semi-structured data, and to construct search and query systems which make direct use of the matches discovered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Köhler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: SIGMOD (2010)
Google Scholar
Broder, A.: On the resemblance and containment of documents. In: SEQUENCES: Proceedings of the Compression and Complexity of Sequences, p. 21 (1997)
Google Scholar
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: SIGMOD, pp. 240–251 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Australia
Xiaofang Zhou & Henning Köhler

Authors

Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Henning Köhler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
Computer Department, Sichuan University, 610064, Chengdu, China
Changjie Tang
Department of Computer Science, Duke University, Box 90129, NC 27708-0129, Durham, USA
Jun Yang
College of Computer Science, Zhejiang University, 388 Yuhangtang Road, 310058, Hangzhou, China
Yunjun Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Köhler, H. (2010). Rebuilding the World from Views. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-14246-8_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics