Skip to main content

Web Document Analysis

  • Chapter
Book cover Digital Document Processing

Part of the book series: Advances in Pattern Recognition ((ACVPR))

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alam, H., Hartono, R., and Rahman, A.F.R. (2004). Extraction and management of content from HTML documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 95-112.

    Google Scholar 

  2. Antonacopoulos, A., Karatzas, D., and Ortiz Lopez, J. (2001). Accessing textual information embedded in internet images. Proceedings of SPIE Internet Imaging II, San Jose, USA, pp. 198-205.

    Google Scholar 

  3. Antonacopoulos, A. and Karatzas, D. (2002). Fuzzy segmentation of characters in Web images based on human colour perception. In: D. Lopresti, J. Hu, and R. Kashi (Eds.). Document Analysis Systems V. London: Springer, LNCS 2423, pp. 295-306.

    Chapter  Google Scholar 

  4. Antonacopoulos, A. and Delporte, F. (1999). Automated interpretation of visual representations: extracting textual information from WWW images. In: R. Paton and I. Neilson (Eds.). Visual Representations and Interpretations. London: Springer.

    Google Scholar 

  5. Baird, H.S. and Popat, K. (2004). Web security and document image analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  6. Blood, R. Weblogs: a history and perspective. http://www.rebeccablood.net/essays/weblog history.html.

  7. Breuel, T.M., Janssen, W.C., Popat, K., and Baird, H.S. (2004). Reflowable document images. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  8. Brown, M.K., Glinski, S.C., and Schmult, B.C. (2001). Web page analysis for voice browsing. Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA.

    Google Scholar 

  9. Chen, L.Q., Xie, X., Ma, W.Y., and Zhang, H.J. (2003). Dress: a slicing tree based web page representation for various display sizes. WWW2003 (poster), Budapest, Hungary.

    Google Scholar 

  10. Chen, Y., Ma, W., and Zhang, H.J. (2003). Detecting web page structure for adaptive viewing on small form-factor devices. WWW2003, Budapest, Hungary.

    Google Scholar 

  11. Cohen, W.W., Hurst, M., and Jensen, L.S. (2004). A wrapper induction system for complex documents and its application to tabular data on the web. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 155-178.

    Google Scholar 

  12. Di Iorio, A. and Vitali, F. (2003). A xanalogical collaborative editing environment. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).

    Google Scholar 

  13. Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). Dom based content extraction of html documents. WWW2003, Budapest, Hungary.

    Google Scholar 

  14. Hsu, C. and Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23, pp. 521-538.

    Article  Google Scholar 

  15. http://www.webdav.org.

  16. Hu, J. and Bagga, A. (2004). Functional categorization of images in web documents. IEEE Multimedia Special Issue on Content Repurposing.

    Google Scholar 

  17. International workshop on web document analysis. http://www.csc.liv.ac.uk/{∼wda2001∼wda2003}.

  18. Jain, A.K. and Yu, B. (1998). Automatic text location in images and video frames. Pattern Recognition, 31(12), pp. 2055-2076.

    Article  Google Scholar 

  19. Ashish, N. and Knoblock, C. (1997). Wrapper generation for semi-structured internet sources. Proceedings of PODS/SIGMOD'97.

    Google Scholar 

  20. Yee, K.P. CritLink: Public Web Annotation. http://zesty.ca/crit.

  21. Kanungo, T., Lee, C.H., and Bradford, R. (2001). What fraction of images on the web contain text? Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA, pp. 43-46.

    Google Scholar 

  22. Karatzas, D. and Antonacopoulos, A. (2004). Text extraction from web images based on a split-and-merge segmentation method using colour perception. Proceedings of the Seventeenth International Conference on Pattern Recognition (ICPR2004), Cambridge, UK. Silver Spring, MD: IEEECS Press, pp. 634-637.

    Google Scholar 

  23. Kasik, D.J.(2004). Strategies for consistent image partitioning. IEEE Multimedia Special Issue on Content Repurposing.

    Google Scholar 

  24. Kushmerick, N., Weld, D. and Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the Fifteenth International Conference on Artificial Intelligence, pp. 729-735.

    Google Scholar 

  25. Lai, W.C., Chang, E.Y., and Cheng, K.T. (2004). An anatomy of a large-scale image search engine. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  26. Leuf, B. and Cummingham, W. (2001). The Wiki way. New York: Addison-Wesley.

    Google Scholar 

  27. Lopresti, D. and Wilfong, G. (2004). Applications of graph probing to web document analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  28. Lopresti, D. and Zhou, J. (2000). Locating and recognizing text in WWW images. Information Retrieval, 2(2/3), pp. 177-206.

    Article  Google Scholar 

  29. Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I.V. (2003). Automatic discovery of semantic structures in html documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.

    Google Scholar 

  30. Muslea, I. (1999). Extracting patterns for information extraction tasks: a survey. AAAI-99 Workshop on Machine Learning for Information Extraction.

    Google Scholar 

  31. Nanno, T., Saito, S., and Okumura, M. (2003). Structuring web pages based on repetition of elements. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).

    Google Scholar 

  32. Narayan, M., Williams, C., Perugini, S., and Ramakrishnan, N. (2004). Staging transformations for multimodal web interaction management. WWW2004. New York, USA, pp. 212-223.

    Google Scholar 

  33. Penn, G., Hu, J., Luo, H., and McDonald, R. (2001). Flexible web document analysis for delivery to narrow-bandwidth devices. Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR01), Seattle, WA, USA, pp. 1074-1078.

    Google Scholar 

  34. Perantonis, S.J., Gatos, B., and Maragos, V. (2003). A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 61-64.

    Google Scholar 

  35. Ramachandran, S. and Kashi, R. (2003). An architecture for ink annotations on web documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.

    Google Scholar 

  36. Ramakrishnan, I.V., Stent, A., and Yang, G. (2004). Hearsay: enabling audio browsing on hypertext content. WWW2004, New York, USA, pp. 80-89.

    Google Scholar 

  37. Schenker, Last, M., Bunke, H., and Kandel, A. (2004). Clustering of web documents using a graph model. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  38. Shih, L.K. and Karger, D.R. (2004). Using URLs and table layout for web classification tasks. WWW2004, New York, USA, pp. 193-202.

    Google Scholar 

  39. Singh, G. (2004). Content repurposing. IEEE Multimedia Special Issue on Content Repurposing.

    Google Scholar 

  40. Tao, C. and Munson, E.V. (2003). A relevance model for web image search. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 58-60.

    Google Scholar 

  41. The ACM Symposium on Document Engineering. http://www.documentengineering.org..

  42. Thuong, T.T. and Roisin, C. (2004). Structured media for authoring multi-media documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  43. van Ossenbruggen, J., Rutledge, L., and Hardman, L. (2003). Towards a multimedia formatting vocabulary. WWW2003, Budapest, Hungary.

    Google Scholar 

  44. Villard, L., Roisin, C., and Layaida, N. (2000). An XML based multimedia document processing model for content adaptation. Digital Documents and Electronic Publishing Conference (DDEP00), pp. 1-12.

    Google Scholar 

  45. Wang, Y. and Hu, J. (2002). A machine learning based approach for table detection on the web. WWW2002, Honolulu, Hawaii, USA.

    Google Scholar 

  46. Yang, Y., Chen, Y., and Zhang, H.J. (2004). HTML page analysis based on visual cues. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  47. Yoshida, M., Torisawa, K., and Tsujii, J. (2004). Extracting attributes and their values from web pages. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.

    Google Scholar 

  48. Zhou, J., Lopresti, D., and Tasdizen, T. (1998). Finding text in color images. Proceedings of the IS&T/SPIE Symposium on Electronic Imaging, San Jose, California, pp. 130-140.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag London Limited

About this chapter

Cite this chapter

Antonacopoulos, A., Hu, J. (2007). Web Document Analysis. In: Chaudhuri, B.B. (eds) Digital Document Processing. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84628-726-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-726-8_18

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84628-501-1

  • Online ISBN: 978-1-84628-726-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics