Web Document Analysis

Antonacopoulos, Apostolos; Hu, Jianying

doi:10.1007/978-1-84628-726-8_18

Apostolos Antonacopoulos³ &
Jianying Hu⁴

Part of the book series: Advances in Pattern Recognition ((ACVPR))

1095 Accesses
2 Citations

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alam, H., Hartono, R., and Rahman, A.F.R. (2004). Extraction and management of content from HTML documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 95-112.
Google Scholar
Antonacopoulos, A., Karatzas, D., and Ortiz Lopez, J. (2001). Accessing textual information embedded in internet images. Proceedings of SPIE Internet Imaging II, San Jose, USA, pp. 198-205.
Google Scholar
Antonacopoulos, A. and Karatzas, D. (2002). Fuzzy segmentation of characters in Web images based on human colour perception. In: D. Lopresti, J. Hu, and R. Kashi (Eds.). Document Analysis Systems V. London: Springer, LNCS 2423, pp. 295-306.
Chapter Google Scholar
Antonacopoulos, A. and Delporte, F. (1999). Automated interpretation of visual representations: extracting textual information from WWW images. In: R. Paton and I. Neilson (Eds.). Visual Representations and Interpretations. London: Springer.
Google Scholar
Baird, H.S. and Popat, K. (2004). Web security and document image analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Blood, R. Weblogs: a history and perspective. http://www.rebeccablood.net/essays/weblog history.html.
Breuel, T.M., Janssen, W.C., Popat, K., and Baird, H.S. (2004). Reflowable document images. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Brown, M.K., Glinski, S.C., and Schmult, B.C. (2001). Web page analysis for voice browsing. Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA.
Google Scholar
Chen, L.Q., Xie, X., Ma, W.Y., and Zhang, H.J. (2003). Dress: a slicing tree based web page representation for various display sizes. WWW2003 (poster), Budapest, Hungary.
Google Scholar
Chen, Y., Ma, W., and Zhang, H.J. (2003). Detecting web page structure for adaptive viewing on small form-factor devices. WWW2003, Budapest, Hungary.
Google Scholar
Cohen, W.W., Hurst, M., and Jensen, L.S. (2004). A wrapper induction system for complex documents and its application to tabular data on the web. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific, pp. 155-178.
Google Scholar
Di Iorio, A. and Vitali, F. (2003). A xanalogical collaborative editing environment. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). Dom based content extraction of html documents. WWW2003, Budapest, Hungary.
Google Scholar
Hsu, C. and Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23, pp. 521-538.
Article Google Scholar
http://www.webdav.org.
Hu, J. and Bagga, A. (2004). Functional categorization of images in web documents. IEEE Multimedia Special Issue on Content Repurposing.
Google Scholar
International workshop on web document analysis. http://www.csc.liv.ac.uk/{∼wda2001∼wda2003}.
Jain, A.K. and Yu, B. (1998). Automatic text location in images and video frames. Pattern Recognition, 31(12), pp. 2055-2076.
Article Google Scholar
Ashish, N. and Knoblock, C. (1997). Wrapper generation for semi-structured internet sources. Proceedings of PODS/SIGMOD'97.
Google Scholar
Yee, K.P. CritLink: Public Web Annotation. http://zesty.ca/crit.
Kanungo, T., Lee, C.H., and Bradford, R. (2001). What fraction of images on the web contain text? Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, USA, pp. 43-46.
Google Scholar
Karatzas, D. and Antonacopoulos, A. (2004). Text extraction from web images based on a split-and-merge segmentation method using colour perception. Proceedings of the Seventeenth International Conference on Pattern Recognition (ICPR2004), Cambridge, UK. Silver Spring, MD: IEEECS Press, pp. 634-637.
Google Scholar
Kasik, D.J.(2004). Strategies for consistent image partitioning. IEEE Multimedia Special Issue on Content Repurposing.
Google Scholar
Kushmerick, N., Weld, D. and Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the Fifteenth International Conference on Artificial Intelligence, pp. 729-735.
Google Scholar
Lai, W.C., Chang, E.Y., and Cheng, K.T. (2004). An anatomy of a large-scale image search engine. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Leuf, B. and Cummingham, W. (2001). The Wiki way. New York: Addison-Wesley.
Google Scholar
Lopresti, D. and Wilfong, G. (2004). Applications of graph probing to web document analysis. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Lopresti, D. and Zhou, J. (2000). Locating and recognizing text in WWW images. Information Retrieval, 2(2/3), pp. 177-206.
Article Google Scholar
Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I.V. (2003). Automatic discovery of semantic structures in html documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.
Google Scholar
Muslea, I. (1999). Extracting patterns for information extraction tasks: a survey. AAAI-99 Workshop on Machine Learning for Information Extraction.
Google Scholar
Nanno, T., Saito, S., and Okumura, M. (2003). Structuring web pages based on repetition of elements. In: A. Antonacopoulos and J. Hu (Eds.). Second International Workshop on Web Document Analysis (WDA2003).
Google Scholar
Narayan, M., Williams, C., Perugini, S., and Ramakrishnan, N. (2004). Staging transformations for multimodal web interaction management. WWW2004. New York, USA, pp. 212-223.
Google Scholar
Penn, G., Hu, J., Luo, H., and McDonald, R. (2001). Flexible web document analysis for delivery to narrow-bandwidth devices. Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR01), Seattle, WA, USA, pp. 1074-1078.
Google Scholar
Perantonis, S.J., Gatos, B., and Maragos, V. (2003). A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 61-64.
Google Scholar
Ramachandran, S. and Kashi, R. (2003). An architecture for ink annotations on web documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR2003), Edinburgh, Scotland.
Google Scholar
Ramakrishnan, I.V., Stent, A., and Yang, G. (2004). Hearsay: enabling audio browsing on hypertext content. WWW2004, New York, USA, pp. 80-89.
Google Scholar
Schenker, Last, M., Bunke, H., and Kandel, A. (2004). Clustering of web documents using a graph model. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Shih, L.K. and Karger, D.R. (2004). Using URLs and table layout for web classification tasks. WWW2004, New York, USA, pp. 193-202.
Google Scholar
Singh, G. (2004). Content repurposing. IEEE Multimedia Special Issue on Content Repurposing.
Google Scholar
Tao, C. and Munson, E.V. (2003). A relevance model for web image search. Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), Edinburgh, Scotland, pp. 58-60.
Google Scholar
The ACM Symposium on Document Engineering. http://www.documentengineering.org..
Thuong, T.T. and Roisin, C. (2004). Structured media for authoring multi-media documents. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
van Ossenbruggen, J., Rutledge, L., and Hardman, L. (2003). Towards a multimedia formatting vocabulary. WWW2003, Budapest, Hungary.
Google Scholar
Villard, L., Roisin, C., and Layaida, N. (2000). An XML based multimedia document processing model for content adaptation. Digital Documents and Electronic Publishing Conference (DDEP00), pp. 1-12.
Google Scholar
Wang, Y. and Hu, J. (2002). A machine learning based approach for table detection on the web. WWW2002, Honolulu, Hawaii, USA.
Google Scholar
Yang, Y., Chen, Y., and Zhang, H.J. (2004). HTML page analysis based on visual cues. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Yoshida, M., Torisawa, K., and Tsujii, J. (2004). Extracting attributes and their values from web pages. In: A. Antonacopoulos and J. Hu (Eds.). Web Document Analysis: Challenges and Opportunities. Singapore: World Scientific.
Google Scholar
Zhou, J., Lopresti, D., and Tasdizen, T. (1998). Finding text in color images. Proceedings of the IS&T/SPIE Symposium on Electronic Imaging, San Jose, California, pp. 130-140.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Science and Engineering, University of Salford, Greater Manchester, UK
Apostolos Antonacopoulos
IBM T. J., Watson Research Center, 1101 Kitchawan Road. Route 134, 14228, Yorktown Heights, NEWYORK, USA
Jianying Hu

Authors

Apostolos Antonacopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Jianying Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
Bidyut B. Chaudhuri PhD

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Antonacopoulos, A., Hu, J. (2007). Web Document Analysis. In: Chaudhuri, B.B. (eds) Digital Document Processing. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84628-726-8_18

Download citation

DOI: https://doi.org/10.1007/978-1-84628-726-8_18
Publisher Name: Springer, London
Print ISBN: 978-1-84628-501-1
Online ISBN: 978-1-84628-726-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics