Information Integration

Liu, Bing

doi:10.1007/978-3-642-19460-3_10

Bing Liu²

Part of the book series: Data-Centric Systems and Applications ((DCSA))

10k Accesses

Abstract

In Chap. 9, we studied data extraction from Web pages. The extracted data is put in tables. For an application, it is, however, often not sufficient to extract data from only a single site. Instead, data from a large number of sites are gathered in order to provide value-added services. In such cases, extraction is only part of the story. The other part is the integration of the extracted data to produce a consistent and coherent database because different sites typically use different data formats. Intuitively, integration means to match columns in different data tables that contain the same type of information (e.g., product names) and to match values that are semantically identical but represented differently in different Web sites (e.g., “Coke” and “Coca Cola”). Unfortunately, limited integration research has been done so far in this specific context. Much of the Web information integration research has been focused on the integration of Web query interfaces. This chapter will have several sections on their integration. However, many ideas developed are also applicable to the integration of the extracted data because the problems are similar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

Agrawal, R. and R. Srikant. On integrating catalogs. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Google Scholar
Batini, C., M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys (CSUR), 1986, 18(4): p. 323–364.
Article Google Scholar
Bergman, M. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 2001, 7(1): p. 07–01.
Article Google Scholar
Bilke, A. and F. Naumann. Schema matching using duplicates. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2005), 2005.
Google Scholar
Chang, K., B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. ACM SIGMOD Record, 2004, 33(3): p. 61–70.
Article Google Scholar
Clifton, C., E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proceedings of IFIP 2.6 Working Conf. Database Semantics, 1997.
Google Scholar
Cohen, W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-1998), 1998.
Google Scholar
Dhamankar, R., Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: discovering complex semantic matches between database schemas. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Google Scholar
Dice, L. Measures of the amount of ecologic association between species. Ecology, 1945, 26(3): p. 297–302.
Article Google Scholar
Do, H. and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In Proceedings of International Conference on Very Large Data Bases (VLDB-2002), 2002.
Google Scholar
Doan, A., P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.
Google Scholar
Doan, A. and A. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 2005, 26(1): p. 83.
Google Scholar
Doan, A., J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.
Google Scholar
Dragut, E., W. Wu, P. Sistla, C. Yu, and W. Meng. Merging source query interfaces onweb databases. In Proceedings of IEEE International Conference on Data Engineering (ICDE-06), 2006.
Google Scholar
Dragut, E., C. Yu, and W. Meng. Meaningful labeling of integrated query interfaces. In Proceedings of International Conference on Very Large Data Bases (VLDB-2006), 2006.
Google Scholar
Embley, D., D. Jackman, and L. Xu. Multifaceted exploitation of metadata for attribute match discovery in information integration. In Proceedings of Workshop on Information Integration on the Web, 2001.
Google Scholar
Gal, A., G. Modica, H. Jamil, and A. Eyal. Automatic ontology matching using application semantics. AI magazine, 2005, 26(1): p. 21.
Google Scholar
He, B. and K. Chang. Statistical schema matching across web query interfaces. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.
Google Scholar
He, B., K. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data
Google Scholar
Mining (KDD-2004), 2004.
Google Scholar
He, H., W. Meng, C. Yu, and Z. Wu. Automatic extraction of web search interfaces for interface schema integration. In Proceedings of WWW Alternate Track Papers and Posters, 2004.
Google Scholar
He, H., W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of International Conference on Very Large Data Bases (VLDB-2003), 2003.
Google Scholar
Kalfoglou, Y. and M. Schorlemmer. Ontology mapping: the state of the art. The knowledge engineering review, 2003, 18(01): p. 1–31.
Article Google Scholar
Kashyap, V. and A. Sheth. Semantic and schematic similarities between database objects: a context-based approach. The VLDB journal, 1996, 5(4): p. 276–304.
Article Google Scholar
Larson, J., S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 1989: p. 449–463.
Google Scholar
Madhavan, J., P. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In Proceedings of IEEE International Conference on Data Engineering (ICDE-2005), 2005.
Google Scholar
Madhavan, J., P. Bernstein, and E. Rahm. Generic schema matching with cupid. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Google Scholar
Miller, G., R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. WordNet: An on-line lexical database. 1990: Oxford Univ. Press.
Google Scholar
Milo, T. and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings of International Conference on Very Large Data Bases (VLDB-1998), 1998.
Google Scholar
Palopoli, L., D. Saccá, and D. Ursino. An automatic technique for detecting type conflicts in database schemes. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-1998), 1998.
Google Scholar
Rahm, E. and P. Bernstein. A survey of approaches to automatic schema matching. The VLDB journal, 2001, 10(4): p. 334–350.
Article MATH Google Scholar
Sheth, A. and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys (CSUR), 1990, 22(3): p. 183–236.
Article Google Scholar
Shvaiko, P. and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics IV, 2005: p. 146–171.
Google Scholar
Wache, H., T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information-a survey of existing approaches. In IJCAI Workshop on Ontologies and Information Sharing, 2001.
Google Scholar
Wang, J., J.-R. Wen, B.A. Lochovsky, and W.-Y. Ma. Instance-Based Schema Matching for Web Databases by Domain-specific Query Probing. In Proceedings of International Conference on Very Large Data Bases (VLDB-2004), 2004.
Google Scholar
Wu, W., A. Doan, and C. Yu. WebIQ: Learning from the web to match deepweb query interfaces. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2006), 2006.
Google Scholar
Wu, W., C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Google Scholar
Xu, L. and D. Embley. Discovering direct and indirect matches for schema elements. In Proceedings of Intl. Conf. on Database Systems for Advanced Applications (DASFAA-2003), 2003.
Google Scholar
Yan, L., R. Miller, L. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.
Google Scholar
Zhang, D. and W. Lee. Web taxonomy integration using support vector machines. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Google Scholar
Zhang, Z., B. He, and K. Chang. Understanding web query interfaces: Besteffort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois, Chicago, 851 S. Morgan St., Chicago, IL, 60607-7053, USA
Bing Liu

Authors

Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, B. (2011). Information Integration. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-19460-3_10
Published: 15 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics