Skip to main content

Recognition of Common Areas in a Web Page Using a Visualization Approach

  • Conference paper
  • First Online:
Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2002)

Abstract

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Quinlan, J.R., “Induction of decision trees”, Machine Learning, 1986, pp. 81–106.

    Google Scholar 

  2. Salton, G., McGill, M.J., An Introduction to Modern Information Retrieval, McGraw-Hill, 1983.

    Google Scholar 

  3. Chakrabarti S., van den Berg M., Dom B., “Focused crawling: A new approach to topicspecific web resource discovery”, Proceedings of the 8th Int. World Wide Web Conference, Toronto, Canada, 1999.

    Google Scholar 

  4. Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M., “Focused crawling using context graphs”, Proceedings of the 26th Int. Conf. On Very Large Databases, Cairo, Egypt, 2000.

    Google Scholar 

  5. Rennie J., McCallum A., “Using reinforcement learning to spider the web efficiently”, Proceedings of the Int. Conf. On Machine Learning, Bled, Slovenia, 1999.

    Google Scholar 

  6. Embley D.W., Jiang Y.S., Ng Y.K., “Record-Boundary Discovery in Web Documents”, Proceedings of SIGMOD, Philadelphia, USA, 1999.

    Google Scholar 

  7. Lim S. J., Ng Y. K., “Extracting Structures of HTML Documents Using a High-Level Stack Machine”, Proceedings of the 12th International Conference on Information Networking ICOIN, Tokyo, Japan, 1998

    Google Scholar 

  8. World Wide Web Consortium (W3C), “HTML 4.01 Specification”, http://www.w3c.org/TR/html401/, December 1999.

  9. Bernard L.M., “Criteria for optimal web design (designing for usability)”, http://psychology.wichita.edu/optimalweb/position.htm, 2001

  10. James F., “Representing Structured Information in Audio Interfaces: A Framework for Selecting Audio Marking Techniques to Represent Document Structures”, Ph.D. thesis, Stanford University, available online at http://www-pcd.stanford.edu/frankie/thesis/, 2001.

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kovačević, M., Dilligenti, M., Gori, M., Milutinović, V. (2002). Recognition of Common Areas in a Web Page Using a Visualization Approach. In: Scott, D. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2002. Lecture Notes in Computer Science(), vol 2443. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46148-5_21

Download citation

  • DOI: https://doi.org/10.1007/3-540-46148-5_21

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44127-4

  • Online ISBN: 978-3-540-46148-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics