Skip to main content

Rebuilding the World from Views

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

  • 1648 Accesses

Abstract

With the ever-increasing growth of the internet, more and more data sets are being made available. Most of this data has its origin in the real world, often describing the same objects or events from different viewpoints. One can thus consider data sets obtained from different sources as different (and possibly inconsistent) views of our world, and it makes sense to try to integrate them in some form, e.g. to answer questions which involve data from multiple sources. While data integration is an old and well-investigated subject, the nature of the data sets to be integrated is changing. They increase in volume as well as complexity, are often undocumented, relationships between data sets are more fuzzy, and representations of the same real-word object differ. To address these challenges, new methods for rapid, semi-automatic, loose and virtual integration, exploration and querying of large families of data sets must be developed.

In an ongoing project we are investigating a framework for sampling and matching data sets in an efficient manner. In particular, we consider the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes [1]. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. We deal with the issue of different representation of objects, i.e., ‘dirty’ data, by employing new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between string instances. To make the measures effective, especially in light of data sets being large and distributed, we developed efficient algorithms for distributed sample creation and similarity computation. Central to this is that sampling is synchronized. For clean data this means that the same values are sampled for each set, if present [2,3]. For dirty data one must ensure that similar values are sampled for each set, if present, and we manage to do so in a probabilistic manner.

The next step of our research is to extend such a sampling and matching approach to multiple attributes and semi-structured data, and to construct search and query systems which make direct use of the matches discovered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Köhler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: SIGMOD (2010)

    Google Scholar 

  2. Broder, A.: On the resemblance and containment of documents. In: SEQUENCES: Proceedings of the Compression and Complexity of Sequences, p. 21 (1997)

    Google Scholar 

  3. Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: SIGMOD, pp. 240–251 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, X., Köhler, H. (2010). Rebuilding the World from Views. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14246-8_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14245-1

  • Online ISBN: 978-3-642-14246-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics