Skip to main content

Information Integration

  • Chapter
  • First Online:
Web Data Mining

Part of the book series: Data-Centric Systems and Applications ((DCSA))

  • 10k Accesses

Abstract

In Chap. 9, we studied data extraction from Web pages. The extracted data is put in tables. For an application, it is, however, often not sufficient to extract data from only a single site. Instead, data from a large number of sites are gathered in order to provide value-added services. In such cases, extraction is only part of the story. The other part is the integration of the extracted data to produce a consistent and coherent database because different sites typically use different data formats. Intuitively, integration means to match columns in different data tables that contain the same type of information (e.g., product names) and to match values that are semantically identical but represented differently in different Web sites (e.g., “Coke” and “Coca Cola”). Unfortunately, limited integration research has been done so far in this specific context. Much of the Web information integration research has been focused on the integration of Web query interfaces. This chapter will have several sections on their integration. However, many ideas developed are also applicable to the integration of the extracted data because the problems are similar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Agrawal, R. and R. Srikant. On integrating catalogs. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.

    Google Scholar 

  2. Batini, C., M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys (CSUR), 1986, 18(4): p. 323–364.

    Article  Google Scholar 

  3. Bergman, M. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 2001, 7(1): p. 07–01.

    Article  Google Scholar 

  4. Bilke, A. and F. Naumann. Schema matching using duplicates. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2005), 2005.

    Google Scholar 

  5. Chang, K., B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. ACM SIGMOD Record, 2004, 33(3): p. 61–70.

    Article  Google Scholar 

  6. Clifton, C., E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proceedings of IFIP 2.6 Working Conf. Database Semantics, 1997.

    Google Scholar 

  7. Cohen, W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-1998), 1998.

    Google Scholar 

  8. Dhamankar, R., Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: discovering complex semantic matches between database schemas. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.

    Google Scholar 

  9. Dice, L. Measures of the amount of ecologic association between species. Ecology, 1945, 26(3): p. 297–302.

    Article  Google Scholar 

  10. Do, H. and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In Proceedings of International Conference on Very Large Data Bases (VLDB-2002), 2002.

    Google Scholar 

  11. Doan, A., P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.

    Google Scholar 

  12. Doan, A. and A. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 2005, 26(1): p. 83.

    Google Scholar 

  13. Doan, A., J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.

    Google Scholar 

  14. Dragut, E., W. Wu, P. Sistla, C. Yu, and W. Meng. Merging source query interfaces onweb databases. In Proceedings of IEEE International Conference on Data Engineering (ICDE-06), 2006.

    Google Scholar 

  15. Dragut, E., C. Yu, and W. Meng. Meaningful labeling of integrated query interfaces. In Proceedings of International Conference on Very Large Data Bases (VLDB-2006), 2006.

    Google Scholar 

  16. Embley, D., D. Jackman, and L. Xu. Multifaceted exploitation of metadata for attribute match discovery in information integration. In Proceedings of Workshop on Information Integration on the Web, 2001.

    Google Scholar 

  17. Gal, A., G. Modica, H. Jamil, and A. Eyal. Automatic ontology matching using application semantics. AI magazine, 2005, 26(1): p. 21.

    Google Scholar 

  18. He, B. and K. Chang. Statistical schema matching across web query interfaces. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.

    Google Scholar 

  19. He, B., K. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data

    Google Scholar 

  20. Mining (KDD-2004), 2004.

    Google Scholar 

  21. He, H., W. Meng, C. Yu, and Z. Wu. Automatic extraction of web search interfaces for interface schema integration. In Proceedings of WWW Alternate Track Papers and Posters, 2004.

    Google Scholar 

  22. He, H., W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of International Conference on Very Large Data Bases (VLDB-2003), 2003.

    Google Scholar 

  23. Kalfoglou, Y. and M. Schorlemmer. Ontology mapping: the state of the art. The knowledge engineering review, 2003, 18(01): p. 1–31.

    Article  Google Scholar 

  24. Kashyap, V. and A. Sheth. Semantic and schematic similarities between database objects: a context-based approach. The VLDB journal, 1996, 5(4): p. 276–304.

    Article  Google Scholar 

  25. Larson, J., S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 1989: p. 449–463.

    Google Scholar 

  26. Madhavan, J., P. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In Proceedings of IEEE International Conference on Data Engineering (ICDE-2005), 2005.

    Google Scholar 

  27. Madhavan, J., P. Bernstein, and E. Rahm. Generic schema matching with cupid. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.

    Google Scholar 

  28. Miller, G., R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. WordNet: An on-line lexical database. 1990: Oxford Univ. Press.

    Google Scholar 

  29. Milo, T. and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings of International Conference on Very Large Data Bases (VLDB-1998), 1998.

    Google Scholar 

  30. Palopoli, L., D. Saccá, and D. Ursino. An automatic technique for detecting type conflicts in database schemes. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-1998), 1998.

    Google Scholar 

  31. Rahm, E. and P. Bernstein. A survey of approaches to automatic schema matching. The VLDB journal, 2001, 10(4): p. 334–350.

    Article  MATH  Google Scholar 

  32. Sheth, A. and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys (CSUR), 1990, 22(3): p. 183–236.

    Article  Google Scholar 

  33. Shvaiko, P. and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics IV, 2005: p. 146–171.

    Google Scholar 

  34. Wache, H., T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information-a survey of existing approaches. In IJCAI Workshop on Ontologies and Information Sharing, 2001.

    Google Scholar 

  35. Wang, J., J.-R. Wen, B.A. Lochovsky, and W.-Y. Ma. Instance-Based Schema Matching for Web Databases by Domain-specific Query Probing. In Proceedings of International Conference on Very Large Data Bases (VLDB-2004), 2004.

    Google Scholar 

  36. Wu, W., A. Doan, and C. Yu. WebIQ: Learning from the web to match deepweb query interfaces. In Proceedings of IEEE International Conference on Data Engingeering (ICDE-2006), 2006.

    Google Scholar 

  37. Wu, W., C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.

    Google Scholar 

  38. Xu, L. and D. Embley. Discovering direct and indirect matches for schema elements. In Proceedings of Intl. Conf. on Database Systems for Advanced Applications (DASFAA-2003), 2003.

    Google Scholar 

  39. Yan, L., R. Miller, L. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2001), 2001.

    Google Scholar 

  40. Zhang, D. and W. Lee. Web taxonomy integration using support vector machines. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.

    Google Scholar 

  41. Zhang, Z., B. He, and K. Chang. Understanding web query interfaces: Besteffort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Liu, B. (2011). Information Integration. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19460-3_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19459-7

  • Online ISBN: 978-3-642-19460-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics