Abstract
In Chap. 9, we studied data extraction from Web pages. The extracted data is put in tables. For an application, it is, however, often not sufficient to extract data from only a single site. Instead, data from a large number of sites are gathered in order to provide value-added services. In such cases, extraction is only part of the story. The other part is the integration of the extracted data to produce a consistent and coherent database because different sites typically use different data formats. Intuitively, integration means to match columns in different data tables that contain the same type of information (e.g., product names) and to match values that are semantically identical but represented differently in different Web sites (e.g., “Coke” and “Coca Cola”). Unfortunately, limited integration research has been done so far in this specific context. Much of the Web information integration research has been focused on the integration of Web query interfaces. This chapter will have several sections on their integration. However, many ideas developed are also applicable to the integration of the extracted data because the problems are similar.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
Agrawal, R. and R. Srikant. On integrating catalogs. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Batini, C., M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys (CSUR), 1986, 18(4): p. 323–364.
Bergman, M. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 2001, 7(1): p. 07–01.
Bilke, A. and F. Naumann. Schema matching using duplicates. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2005), 2005.
Chang, K., B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. ACM SIGMOD Record, 2004, 33(3): p. 61–70.
Clifton, C., E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proceedings of IFIP 2.6 Working Conf. Database Semantics, 1997.
Cohen, W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-1998), 1998.
Dhamankar, R., Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: discovering complex semantic matches between database schemas. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Dice, L. Measures of the amount of ecologic association between species. Ecology, 1945, 26(3): p. 297–302.
Do, H. and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In Proceedings of International Conference on Very Large Data Bases (VLDB-2002), 2002.
Doan, A., P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.
Doan, A. and A. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 2005, 26(1): p. 83.
Doan, A., J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.
Dragut, E., W. Wu, P. Sistla, C. Yu, and W. Meng. Merging source query interfaces onweb databases. In Proceedings of IEEE International Conference on Data Engineering (ICDE-06), 2006.
Dragut, E., C. Yu, and W. Meng. Meaningful labeling of integrated query interfaces. In Proceedings of International Conference on Very Large Data Bases (VLDB-2006), 2006.
Embley, D., D. Jackman, and L. Xu. Multifaceted exploitation of metadata for attribute match discovery in information integration. In Proceedings of Workshop on Information Integration on the Web, 2001.
Gal, A., G. Modica, H. Jamil, and A. Eyal. Automatic ontology matching using application semantics. AI magazine, 2005, 26(1): p. 21.
He, B. and K. Chang. Statistical schema matching across web query interfaces. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.
He, B., K. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-2004), 2004.
He, H., W. Meng, C. Yu, and Z. Wu. Automatic extraction of web search interfaces for interface schema integration. In Proceedings of WWW Alternate Track Papers and Posters, 2004.
He, H., W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of International Conference on Very Large Data Bases (VLDB-2003), 2003.
Kalfoglou, Y. and M. Schorlemmer. Ontology mapping: the state of the art. The knowledge engineering review, 2003, 18(01): p. 1–31.
Kashyap, V. and A. Sheth. Semantic and schematic similarities between database objects: a context-based approach. The VLDB journal, 1996, 5(4): p. 276–304.
Larson, J., S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 1989: p. 449–463.
Madhavan, J., P. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In Proceedings of IEEE International Conference on Data Engineering (ICDE-2005), 2005.
Madhavan, J., P. Bernstein, and E. Rahm. Generic schema matching with cupid. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Miller, G., R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. WordNet: An on-line lexical database. 1990: Oxford Univ. Press.
Milo, T. and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings of International Conference on Very Large Data Bases (VLDB-1998), 1998.
Palopoli, L., D. Saccá, and D. Ursino. An automatic technique for detecting type conflicts in database schemes. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-1998), 1998.
Rahm, E. and P. Bernstein. A survey of approaches to automatic schema matching. The VLDB journal, 2001, 10(4): p. 334–350.
Sheth, A. and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys (CSUR), 1990, 22(3): p. 183–236.
Shvaiko, P. and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics IV, 2005: p. 146–171.
Wache, H., T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information-a survey of existing approaches. In IJCAI Workshop on Ontologies and Information Sharing, 2001.
Wang, J., J.-R. Wen, B.A. Lochovsky, and W.-Y. Ma. Instance-Based Schema Matching for Web Databases by Domain-specific Query Probing. In Proceedings of International Conference on Very Large Data Bases (VLDB-2004), 2004.
Wu, W., A. Doan, and C. Yu. WebIQ: Learning from the web to match deepweb query interfaces. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2006), 2006.
Wu, W., C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Xu, L. and D. Embley. Discovering direct and indirect matches for schema elements. In Proceedings of Intl. Conf. on Database Systems for Advanced Applications (DASFAA-2003), 2003.
Yan, L., R. Miller, L. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.
Zhang, D. and W. Lee. Web taxonomy integration using support vector machines. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Zhang, Z., B. He, and K. Chang. Understanding web query interfaces: Besteffort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Liu, B. (2011). Information Integration. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-19460-3_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)