Skip to main content

Site-Wide Wrapper Induction for Life Science Deep Web Databases

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5647))

Abstract

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels – giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anton, T.: XPath-Wrapper Induction by generalizing tree traversal patterns. In: Workshop on Web Mining, in ECML/PKDD (2006)

    Google Scholar 

  2. Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: WebDB, pp. 1–6 (2005)

    Google Scholar 

  3. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. 27th Interntnl. Conference on Very Large Data Bases, pp. 119–128 (2001)

    Google Scholar 

  4. Chakrabarti, S., et al.: Mining the Web’s link structure. Computer 32(8), 60–67 (1999)

    Article  Google Scholar 

  5. Chang, K.C.-C., Cho, J.: Accessing the Web: From Search to Integration. In: Proceedings of the 2006 ACM SIGMOD Conference (2006)

    Google Scholar 

  6. Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured web pages by pattern discovery. SCI expanded 35(1), 129–147 (2003), Special Issue on Web Retrieval and Mining

    Google Scholar 

  7. Chang, K.C.-C., He, B., Zhang, Z.: Mining Semantics for Large Scale Integration on the Web: Evidences, Insights and Challenges. SIGKDD Explorations 6(2), 67–76 (2004)

    Article  Google Scholar 

  8. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB, pp. 109–118 (2001)

    Google Scholar 

  9. Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data & Knowledge Engineering 54, 279–299 (2005)

    Article  Google Scholar 

  10. Crescenzi, V., Mecca, G., Merialdo, P.: Improving the expressiveness of ROADRUNNER. In: SEBD, pp. 62–69 (2004)

    Google Scholar 

  11. de Castro Reis, D., et al.: Automatic web news extraction using tree edit distance. In: WWW13, pp. 502–511 (2004)

    Google Scholar 

  12. Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 350, D344–D350 (2008)

    Google Scholar 

  13. Golovin, A., et al.: E-MSD: an integrated data. Nucleic Acids Research 32(Database issue), 211–216 (2004)

    Article  Google Scholar 

  14. He, B., Chang, K.C.-C.: Statistical Schema Matching across Web Query Interfaces. In: SIGMOD Conference, pp. 217–228 (2003)

    Google Scholar 

  15. He, H., Meng, W., Yu, C.T., Wu, Z.: WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. In: VLDB, pp. 357–368 (2003)

    Google Scholar 

  16. He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)

    Google Scholar 

  17. Kanehisa, M.: The KEGG database. In: Novartis Found Symp., vol. 247, pp. 91–101, discussion 101–3, 119–28, 244–52 (2002)

    Google Scholar 

  18. Knoblock, C., Kambhampati, C.: Information Integration on the Web. In: AAAI (2002)

    Google Scholar 

  19. Kabra, G., Li, C., Chang, K.C.C.: Query Routing: Finding Ways in the Maze of the DeepWeb. In: WIRI 2005, pp. 64–73 (2005)

    Google Scholar 

  20. Kushmerick, N.: Wrapper Induction for information extraction. In: ICAI (1998)

    Google Scholar 

  21. Kushmerick, N.: Learning to Invoke Web Forms. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  22. Laender, A.H.F., Ribeiro-Neto, B., Silva, A.S.D., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  23. Lu, Y., et al.: Clustering e-commerce search engines based on search interface pages using WISE-Cluster. Data Knowl. Eng. 59(2), 231–246 (2006)

    Article  Google Scholar 

  24. Madhavan, J., et al.: Corpus-based Schema Matching. In: ICDE, pp. 57–68 (2005)

    Google Scholar 

  25. Myllymaki, J., Jackson, J.: Robust Web Data Extraction with XML Path Expressions. IBM Research Report (2002)

    Google Scholar 

  26. Muslea, I., Minton, S., Knoblock, C.: Stalker: Learning extraction rules for semistructured, web-based information sources. In: AAAI 1998: AI and Information Integration Workshop (1998)

    Google Scholar 

  27. Meng, W., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW14 (2005)

    Google Scholar 

  28. Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)

    Article  Google Scholar 

  29. Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388 (2005)

    Google Scholar 

  30. Vidal, A., et al.: Structure-driven crawler generation by example. In: SIGIR 2006, pp. 292–299 (2006)

    Google Scholar 

  31. Wang, J., Wen, J.-R., Lochovsky, F.H., Ma, W.-Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: VLDB, pp. 408–419 (2004)

    Google Scholar 

  32. Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE, p. 44 (2006)

    Google Scholar 

  33. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW12, p. 187–196 (2003)

    Google Scholar 

  34. Zhang, Z., He, B., Chang, K.C.-C.: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In: SIGMOD Conference, pp. 107–118 (2004)

    Google Scholar 

  35. Zhai, Y., Liu, B.: Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment. In: AAAI 2006, Boston, USA, July 16-20 (2006)

    Google Scholar 

  36. Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: WWW16 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mir, S., Staab, S., Rojas, I. (2009). Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: Paton, N.W., Missier, P., Hedeler, C. (eds) Data Integration in the Life Sciences. DILS 2009. Lecture Notes in Computer Science(), vol 5647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02879-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02879-3_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02878-6

  • Online ISBN: 978-3-642-02879-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics