Skip to main content

Instance Discovery and Schema Matching with Applications to Biological Deep Web Data Integration

  • Conference paper
Data Integration in the Life Sciences (DILS 2010)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6254))

Included in the following conference series:

Abstract

This paper presents data mining-based techniques for enabling data integration across deep web data sources. We target query processing across inter-dependent data sources. Thus, besides input-input and output-output matching of attributes, we also need to consider input-output matching. We develop data mining techniques for discovering the instances for querying deep web data sources from the information provided by the query interfaces themselves, as well as from the obtained output pages of the related data sources, by query probing using dynamically identified input instances. Then, using a hierarchical representation of schemas and by applying clustering techniques, we are able to generate schema matches. We show the effectiveness of our technique while integrating 24 query interfaces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brookes, A.J.: The essence of snps. Gene. 234, 177–186 (1999)

    Article  Google Scholar 

  2. Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Second IFCIS International Conference on Cooperative Information Systems. IEEE Computer Society, Los Alamitos (1997)

    Google Scholar 

  3. Babu, P.A., Boddepalli, R., Lakshmi, V.V., Rao, G.N.: Dod: Database of databases–updated molecular biology databases. Silico Biol. 5 (2005)

    Google Scholar 

  4. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of SDDB (2004)

    Google Scholar 

  5. Bergman, M.K.: The deep web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (August 2001)

    Google Scholar 

  6. Buneman, P., Davidson, S.B., Hart, K., Overton, C., Wong, L.: A data transformation system for biological data sources. In: Proceedings of the Twenty-first International Conference on Very Large Databases (1995)

    Google Scholar 

  7. Callan, J.: Query-based sampling of text databases. ACM Transactions on Information Systems 19, 97–130 (2001)

    Article  Google Scholar 

  8. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)

    Google Scholar 

  9. He, B.: Statistical schema matching across web query interfaces. In: SIGMOD Conference, pp. 217–228 (2003)

    Google Scholar 

  10. He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: a system for extracting and integrating complex web search interfaces of the deep web. In: VLDB 2005: Proceedings of the 31st international conference on Very large data bases, pp. 1314–1317. VLDB Endowment (2005)

    Google Scholar 

  11. Hern, T., Kambhampati, S.: Integration of biological sources: Current systems and challenges ahead. Sigmod Record 33, 51–60 (2004)

    Google Scholar 

  12. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. The VLDB Journal, 49–58 (2001)

    Google Scholar 

  13. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web Crawl. VLDB Endowment 1, 1241–1252 (2008)

    Google Scholar 

  14. Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level vertical search. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 235–246 (2007)

    Google Scholar 

  15. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(2001) (2001)

    Google Scholar 

  16. Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)

    Google Scholar 

  17. Sarma, A.D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 861–874. ACM, New York (2008)

    Chapter  Google Scholar 

  18. Wang, F., Agrawal, G., Jin, R.: Query planning for searching inter-dependent deep-web databases. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 24–41. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  19. Wang, G., Goguen, J., Nam, Y.k., Lin, K.: Interactive schema matching with semantic functions. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 654–664. Springer, Heidelberg (2004)

    Google Scholar 

  20. Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: VLDB 2004: Proceedings of the Thirtieth international conference on Very large data bases, pp. 408–419. VLDB Endowment (2004)

    Google Scholar 

  21. Wu, W., Doan, A., Yu, C.: Webiq: Learning from the web to match deep-web query interfaces. In: International Conference on Data Engineering, p. 44 (2006)

    Google Scholar 

  22. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 95–106. ACM Press, New York (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, T., Wang, F., Agrawal, G. (2010). Instance Discovery and Schema Matching with Applications to Biological Deep Web Data Integration. In: Lambrix, P., Kemp, G. (eds) Data Integration in the Life Sciences. DILS 2010. Lecture Notes in Computer Science(), vol 6254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15120-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15120-0_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15119-4

  • Online ISBN: 978-3-642-15120-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics