An FAR-SW based approach for webpage information extraction

Bu, Zhan; Zhang, Chengcui; Xia, Zhengyou; Wang, Jiandong

doi:10.1007/s10796-013-9412-2

An FAR-SW based approach for webpage information extraction

Published: 19 February 2013

Volume 16, pages 771–785, (2014)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Zhan Bu^1,2,
Chengcui Zhang²,
Zhengyou Xia¹ &
…
Jiandong Wang¹

503 Accesses
9 Citations
Explore all metrics

Abstract

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

Clustering for Knowledgeable Web Mining

Making Large Information Sources Better Accessible Using Fuzzy Set Theory

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. ISBN:1-55860-153-8.
Alexjc. (2007). The easy way to extract useful text from arbitrary HTML. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-fromarbitrary-html/. [Accessed 5 April 2007].
Bu, Z., Xia, Z., & Wang, J. (2013a). A sock puppet detection algorithm on virtual spaces. Knowledge Based Systems, 37, 366–377.
Article Google Scholar
Bu, Z., Xia, Z., Wang, J., & Zhang, C. (2013). A last updating evolution model for online social networks. Physica A: Statistical Mechanics and its Applications. [Available online 17 January 2013].
Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific web conference on web technologies and applications. 406–417. ISBN:3-540-02354-2.
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. Cambridge: Cambridge University.
Google Scholar
Eduardo, S. L., Críston, P. S., Iam, V. J., Evelin, C. F. A., Eduardo, T. C., Raúl, P. R., et al. (2009). A fast and simple method for extracting relevant content from news webpages. In Proceeding of CIKM, 1685–1688.
Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. In Proceedings of the 14th international conference on WWW, 830–839. doi:10.1145/1062745.1062763.
Gupta, S., Kaiser, G., Neistadt, D., & Grimm, P. (2003). DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on WWW, 207–214. doi:10.1145/775152.775182
Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8(1), 53–87.
Article Google Scholar
Hegland, M. (2005). The Apriori Algorithm—a tutorial. In Mathematics and computation in imaging science and information processing. Singapore: World Scientific Publishing Co. Pte. Ltd.
Google Scholar
Kang, J., Yang, J., & Choi, J. (2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics, 56(2), 980–986.
Article Google Scholar
Kao, H.-Y., Ho, J.-M., & Chen, M.-S. (2005). WISDOM: web intrapage informative structure mining based on document object model. IEEE Transactions on Knowledge and Data Engineering, 17(5), 614–627.
Article Google Scholar
Koch, P. P. (2001). The document object model: an introduction. Digital Web Magazine. http://www.digital-web.com/articles/the_document_object_model/. [Accessed 10 January 2009].
Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.
Article Google Scholar
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Google Scholar
Theofanos, M. F., & Redish, J. (2003). Guidelines for accessible and usable web sites: observing users who work with screen readers. http://www.redish.net/content/Papers/Interactions.Html/. [Accessed 20 July 2008].
Xia, Z., & Bu, Z. (2012). Community detection based on a semantic network. Knowledge Based Systems, 26, 30–39.
Article Google Scholar
Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.
Article Google Scholar
Zhou, B., Xiong, Y., & Liu, W. (2009). Efficient web page main text extraction towards online news analysis. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering, 37–41. doi:10.1109/ICEBE.2009.15.

Download references

Acknowledgments

This work was supported by JIANGSU INNOVATION PROGRAM FOR GRADUATE EDUCATION (Project NO: CXZZ12_0162) and THE FUNDAMENTAL RESEARCH FUNDS FOR THE CENTRAL UNIVERSITIES.

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Zhan Bu, Zhengyou Xia & Jiandong Wang
Computer and Information Sciences, The University of Alabama at Birmingham, Birmingham, AL, USA
Zhan Bu & Chengcui Zhang

Authors

Zhan Bu
View author publications
You can also search for this author in PubMed Google Scholar
Chengcui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyou Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jiandong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhan Bu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, Z., Zhang, C., Xia, Z. et al. An FAR-SW based approach for webpage information extraction. Inf Syst Front 16, 771–785 (2014). https://doi.org/10.1007/s10796-013-9412-2

Download citation

Published: 19 February 2013
Issue Date: November 2014
DOI: https://doi.org/10.1007/s10796-013-9412-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An FAR-SW based approach for webpage information extraction

Abstract

Access this article

Similar content being viewed by others

Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

Clustering for Knowledgeable Web Mining

Making Large Information Sources Better Accessible Using Fuzzy Set Theory

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An FAR-SW based approach for webpage information extraction

Abstract

Access this article

Similar content being viewed by others

Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

Clustering for Knowledgeable Web Mining

Making Large Information Sources Better Accessible Using Fuzzy Set Theory

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation