Skip to main content

On the Automatic Extraction of Data from the Hidden Web

  • Conference paper
  • First Online:
Conceptual Modeling for New Information Systems Technologies (ER 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2465))

Included in the following conference series:

Abstract

An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are easy and precise) and from a data management perspective (static pages need not be maintained; databases can be accessed directly), automated agents have greater difficulty accessing data behind forms. In this paper we present a method for automatically filling in forms to retrieve the associated dynamically generated pages. Using our approach automated agents can begin to systematically access portions of the “hidden Web.”

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Michael K. Bergman. The Deep Web: Surfacing Hidden Value. BrightPlanet.com, July 2000. Downloadable from http://www.brightplanet.com/deepcontent/deepwebwhitepaper.pdf, checked August 10, 2001.

  2. Completeplanet.com home page. http://www.completeplanet.com. Checked August 10, 2001.

  3. Hasan Davulcu, Juliana Freire, Michael Kifer, and I.V. Ramakrishnan. A layered architecture for querying dynamic Web content. In SIGMOD’ 99 Proceedings, pages 491–502, Philadelphia, PA, May 1999.

    Google Scholar 

  4. Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Confence on Autonomous Agents, pages 39–48, Marina del Rey, CA, February 1997.

    Google Scholar 

  5. Patil systems home page. http://www.patils.com. Describes LiveFORM and ebCARD services. Checked August 10, 2001.

  6. eCode.com home page. http://www.eCode.com.. Checked August 10, 2001.

  7. D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 31:227–251, 1999.

    Article  MATH  Google Scholar 

  8. Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the World-Wide Web: A survey. SIGMOD Record, 27(3):59–74, 1998.

    Article  Google Scholar 

  9. Alon Y. Halevy. Answering queries using views: A survey. VLDB Journal (online, to appear), 2001.

    Google Scholar 

  10. HTML 4.01 specification. http://www.w3.org/TR/html4, December 1999. Checked August 10, 2001.

  11. InvisibleWeb.com home page. http://www.invisibleweb.com.. Checked August 10, 2001.

  12. Henry Kautz, Bart Selman, and Mehul Shah. The hidden web. AI Magazine, 18(2):27–36, Summer 1997.

    Google Scholar 

  13. Steve Lawrence and C. Lee Giles. Accessibility of information on the Web. Nature, 400:107–109, 1999.

    Article  Google Scholar 

  14. Steve Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280:98–100, April 1999.

    Google Scholar 

  15. Thomas Leonard. A Course In Categorical Data Analysis. Chapman & Hall/CRC, New York, 2000.

    MATH  Google Scholar 

  16. Robert A. McLean and Virgil L. Anderson. Applied Factorial and Fractional Designs. Marcel Dekker, Inc., New York, 1984.

    MATH  Google Scholar 

  17. Microsoft Passport and Wallet services. http://memberservices.passport.com.. Checked August 10, 2001.

  18. R.L. Plackett. The Analysis of Categorical Data, 2 nd Edition. Charles Griffin & Company Ltd., London, 1981.

    MATH  Google Scholar 

  19. Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. Technical Report 2000-36, Computer Science Department, Stanford University, December 2000. Available at http://dbpubs.stanford.edu/pub/2000-36.

  20. Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. In VLDB 2001 Proceedings, Rome, Italy, September 2001. To appear.

    Google Scholar 

  21. Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D. Ullman. Answering queries using templates with binding patterns. In PODS’ 95 Proceedings, pages 105–112, San Jose, CA, 1995.

    Google Scholar 

  22. Randy D. Smith. Copy detection system for digital documents. Master’s thesis, Computer Science Department, Brigham Young University, 2000.

    Google Scholar 

  23. Ajit C. Tamhane and Dorothy D. Dunlop. Statistics and Data Analysis: From Elementary to Intermediate. Prentice-Hall, New Jersey, 2000.

    Google Scholar 

  24. Peter Tryfos. Sampling Methods For Applied Research: Text and Cases. Wiley, New York, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liddle, S.W., Yau, S.H., Embley, D.W. (2002). On the Automatic Extraction of Data from the Hidden Web. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_17

Download citation

  • DOI: https://doi.org/10.1007/3-540-46140-X_17

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44122-9

  • Online ISBN: 978-3-540-46140-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics