Skip to main content

A Method for Web Information Extraction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4976))

Abstract

The Word Wide Web has become one of the most important information repositories. However, information in web pages is free from standards in presentation and lacks being organized in a good format. It is a challenging work to extract appropriate and useful information from Web pages. Currently, many web extraction systems called web wrappers, either semi-automatic or fully-automatic, have been developed. In this paper, some existing techniques are investigated, then our current work on web information extraction is presented. In our design, we have classified the patterns of information into static and non-static structures and use different technique to extract the relevant information. In our implementation, patterns are represented with XSL files, and all the extracted information is packaged into a machine-readable format of XML.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eikvil, L.: Information Extraction from World Wide Web – A Survey, Technical Report 945, Norweigan Computing Center, Oslo, Norway (July 1999)

    Google Scholar 

  2. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  3. Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Hammer, J., McHugh, J., Garcia-Molina, H. (eds.) Proc. I East-European Workshop on Advances in Database and Information Systems - ADBIS 1997, Petersburg, Russia (1997)

    Google Scholar 

  4. Arocena, G., Mendelzon, A.: WebOQL: Restructuring Documents, Databases, and Webs. In: Proc. IEEE Intl. Conf. Data Engineering 1998, Orlando (February 1998)

    Google Scholar 

  5. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the international conference on data engineering (ICDE), pp. 611–621 (2000)

    Google Scholar 

  6. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc 27th Very Large Databases Conference, VLDB 2001, pp. 109–118 (2001)

    Google Scholar 

  7. Freitag, D.: Information Extraction from HTML: Application of a General Learning Approach. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998) (1998)

    Google Scholar 

  8. Solderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 34, 233–272 (1999)

    Article  Google Scholar 

  9. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in KnowItAll (preliminary results). In: Proceedings of the 13th World Wide Web Conference, pp. 100–109 (2004)

    Google Scholar 

  10. Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23(8), 521–538 (1998)

    Article  Google Scholar 

  11. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    Article  Google Scholar 

  12. Adelberg, B.: NoDoSE—A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, June 1998, pp. 283–294 (1998)

    Google Scholar 

  13. Snoussi, H., Magnin, L., Nie, J.-Y.: Toward an Ontology-based Web Data Extraction (2002)

    Google Scholar 

  14. Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proceedings of the 21th International Conference on Distributed Computing Systems, pp. 361–370 (2001)

    Google Scholar 

  15. Papadakis, N.K., Skoutas, D., Raftopoulos, K.: IEEE Computer Society. In: Varvarigou, T.A. (ed.) STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques, IEEE Transactions on Knowledge and Data Engineering, vol. 17(12), pp. 1638–1652 (December 2005)

    Google Scholar 

  16. Xiao, L., Wissmann, D.: Information Extraction from the Web: System and Techniques. Applied Intelligence 21, 195–224 (2004)

    Article  MATH  Google Scholar 

  17. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database, Revised (August 1993)

    Google Scholar 

  18. Cardie, C.: Empirical methods in information extraction. AI Magazine 18(4), 65–80 (1997)

    Google Scholar 

  19. Miniwatts Marking Group, URL 1: http://www.internetworldstats.com/

  20. XHTML, W3C Recommendation, URL 2: http://www.w3.org/TR/xhtml1/

  21. XSLT Tutorial, URL 3: http://www.zvon.org/xxl/XSLTutorial/Output/contents.html

  22. XSL Transformations, W3C Recommendation, URL 4: http://www.w3.org/TR/xslt.html

  23. XML Schema Primer, W3C Working Draft, URL 5: http://www.w3.org/TR/xmlschema-0/

  24. HTML Tidy Library Project, URL 6: http://tidy.sourceforge.net/

  25. Saxon Processor, URL 7: http://www.saxonica.com/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Yanchun Zhang Ge Yu Elisa Bertino Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lam, M.I., Gong, Z., Muyeba, M. (2008). A Method for Web Information Extraction. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78849-2_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78848-5

  • Online ISBN: 978-3-540-78849-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics