Skip to main content
Log in

An FAR-SW based approach for webpage information extraction

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. ISBN:1-55860-153-8.

  • Alexjc. (2007). The easy way to extract useful text from arbitrary HTML. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-fromarbitrary-html/. [Accessed 5 April 2007].

  • Bu, Z., Xia, Z., & Wang, J. (2013a). A sock puppet detection algorithm on virtual spaces. Knowledge Based Systems, 37, 366–377.

    Article  Google Scholar 

  • Bu, Z., Xia, Z., Wang, J., & Zhang, C. (2013). A last updating evolution model for online social networks. Physica A: Statistical Mechanics and its Applications. [Available online 17 January 2013].

  • Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific web conference on web technologies and applications. 406–417. ISBN:3-540-02354-2.

  • Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. Cambridge: Cambridge University.

    Google Scholar 

  • Eduardo, S. L., Críston, P. S., Iam, V. J., Evelin, C. F. A., Eduardo, T. C., Raúl, P. R., et al. (2009). A fast and simple method for extracting relevant content from news webpages. In Proceeding of CIKM, 1685–1688.

  • Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. In Proceedings of the 14th international conference on WWW, 830–839. doi:10.1145/1062745.1062763.

  • Gupta, S., Kaiser, G., Neistadt, D., & Grimm, P. (2003). DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on WWW, 207–214. doi:10.1145/775152.775182

  • Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8(1), 53–87.

    Article  Google Scholar 

  • Hegland, M. (2005). The Apriori Algorithm—a tutorial. In Mathematics and computation in imaging science and information processing. Singapore: World Scientific Publishing Co. Pte. Ltd.

    Google Scholar 

  • Kang, J., Yang, J., & Choi, J. (2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics, 56(2), 980–986.

    Article  Google Scholar 

  • Kao, H.-Y., Ho, J.-M., & Chen, M.-S. (2005). WISDOM: web intrapage informative structure mining based on document object model. IEEE Transactions on Knowledge and Data Engineering, 17(5), 614–627.

    Article  Google Scholar 

  • Koch, P. P. (2001). The document object model: an introduction. Digital Web Magazine. http://www.digital-web.com/articles/the_document_object_model/. [Accessed 10 January 2009].

  • Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.

    Article  Google Scholar 

  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

    Google Scholar 

  • Theofanos, M. F., & Redish, J. (2003). Guidelines for accessible and usable web sites: observing users who work with screen readers. http://www.redish.net/content/Papers/Interactions.Html/. [Accessed 20 July 2008].

  • Xia, Z., & Bu, Z. (2012). Community detection based on a semantic network. Knowledge Based Systems, 26, 30–39.

    Article  Google Scholar 

  • Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.

    Article  Google Scholar 

  • Zhou, B., Xiong, Y., & Liu, W. (2009). Efficient web page main text extraction towards online news analysis. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering, 37–41. doi:10.1109/ICEBE.2009.15.

Download references

Acknowledgments

This work was supported by JIANGSU INNOVATION PROGRAM FOR GRADUATE EDUCATION (Project NO: CXZZ12_0162) and THE FUNDAMENTAL RESEARCH FUNDS FOR THE CENTRAL UNIVERSITIES.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhan Bu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, Z., Zhang, C., Xia, Z. et al. An FAR-SW based approach for webpage information extraction. Inf Syst Front 16, 771–785 (2014). https://doi.org/10.1007/s10796-013-9412-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-013-9412-2

Keywords

Navigation