Abstract
Many documents available on the current web have quite a complex structure that allows to present various kinds of information. Apart from the main content, the documents usually contain headers and footers, navigation sections and other types of additional information. For many applications such as document indexing or browsing on special devices, it is desirable that the main document information should precede the additional information in the underlying HTML code. In this paper, we propose a method of document preprocessing that automatically restructures the document code according to this criteria. Our method is based on rendered document analysis. A page segmentation algorithm is used for detecting the basic blocks on the page and the relevance of the individual parts is estimated from the visual properties of the text content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bos, B., Lie, H.W., Lilley, C., Jacobs, I.: Cascading Style Sheets, level 2, CSS2 Specification. The World Wide Web Consortium (1998)
Burget, R.: Automatic document structure detection for data integration. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 394–400. Springer, Heidelberg (2007)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Research (2003)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of HTML documents. In: WWW 2003 Proceedings of the 12 Web Conference, pp. 207–214 (2003)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: ICDM 2002, p. 250. IEEE Computer Society, Washington (2002)
Lin, S.H., Ho, J.M.: Discovering informative content blocks from web documents. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 588–593. ACM Press, New York (2002)
Meunier, J.L.: Optimized xy-cut for determining a page reading order. ICDAR 0, 347–351 (2005)
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 203–211. ACM Press, New York (2004)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York (2003)
Yu, S., Cai, D., Wen, J.R., Ma, W.Y.: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. Microsoft Research (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Burget, R. (2010). Automatic Web Document Restructuring Based on Visual Information Analysis. In: Snášel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J. (eds) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and Soft Computing, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10687-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-10687-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10686-6
Online ISBN: 978-3-642-10687-3
eBook Packages: EngineeringEngineering (R0)