Skip to main content
Log in

Extracting multiple news attributes based on visual features

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The origin is the top-left corner of a web page, and (x, y) is the top-left corner of a text block.

References

  • Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). VIPS: A vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79.

  • Chang, C.-H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.

    Article  Google Scholar 

  • Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: Towards automatic data extraction from large web sites. In Proc. of VLDB 2001 (pp. 109–118).

  • George, C., & Edward, G. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.

    Article  MathSciNet  Google Scholar 

  • Hu, Q., & Huang, X. (2010). Passage extraction and result combination for genomics information retrieval. Journal of Intelligent Information Systems, 34(3), 249–274.

    Article  Google Scholar 

  • Laender, A., Ribeiro-Neto, B., da Silva, A., & Teixeira, J. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.

    Article  Google Scholar 

  • Lafferty, J. D., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML 2001 (pp. 282–289).

  • Liu, B., Grossman, R. L., & Zhai, Y. (2003). Mining data records in Web pages. In Proc. of KDD 2003 (pp. 601–606).

  • Lu, Y., He, H., Zhao, H., Meng, W., & Yu, C. T. (2007). Annotating structured data of the deep web. In Proc. of ICDE 2007 (pp. 376–385).

  • Luo, P., Fan, J., Liu, S., Lin, F., Xiong, Y., & Liu, J. (2009). Web article extraction for web printing: A DOM+visual based approach. In Proc. of ACM Symposium on Document Engineering 2009 (pp. 66–69).

  • Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proc. of WWW 2009 (pp. 971–980).

  • Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann.

  • Reis, D., Golgher, P., & Silva, A. (2004). Automatic web news extraction using tree edit distance. In Proc. of WWW 2004 (pp. 502–511).

  • Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136.

    Article  Google Scholar 

  • Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.

  • Shi, Z., Milios, E., & Zincir-Heywood, N. (2005). Post-supervised template induction for information extraction from lists and tables in dynamic web sources. Journal of Intelligent Information Systems, 25(1), 69–93.

    Article  Google Scholar 

  • Simon, K., & Lausen, G. (2005). ViPER: Augmenting automatic information extraction with visual perceptions. In Proc. of CIKM 2005 (pp. 381–388).

  • Singla, P., & Domingos, P. (2005). Discriminative training of Markov logic networks. In Proc. of AAAI 2005 (pp. 868–873).

  • Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proc. of WWW 2009 (pp. 1085–1086).

  • Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., & Ma, W.-Y. (2009). Incorporating site-level knowledge to extract structured data from web forums. In Proc. of WWW 2009 (pp. 181–190).

  • Zhai, Y., & Liu, B. (2005). Web data extraction based on partial tree alignment. In Proc. of WWW 2005 (pp. 76–85).

  • Zhao, H., Meng, W., & Wu, Z. (2005). Fully automatic wrapper generation for search engines. In Proc. of WWW 2005 (pp. 66–75).

  • Zheng, S., Song, R., & Wen, J.-R. (2007). Template-independent news extraction based on visual consistency. In Proc. of AAAI 2007 (pp. 1507–1511).

  • Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., & Ma, W.-Y. (2005). 2D conditional random fields for Web information extraction. In Proc. of ICML 2005 (pp. 1044–1051).

  • Zhu, J., Nie, Z., & Wen, J.-R. (2006). Simultaneous record detection and attribute labeling in web data extraction. In Proc. of KDD 2006 (pp. 494–503).

  • Zhu, J., Nie, Z., & Zhang, B. (2007). Dynamic hierarchical Markov random fields and their application to web data extraction. In Proc. of ICML 2007 (pp. 1175–1182).

Download references

Acknowledgements

This research is supported by Advanced Research Fund of Institute of Scientific and Technical Information of China under grant YY-201005. The authors are also grateful to the anonymous reviewers for their constructive comments, which have helped improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Yan, H. & Xiao, J. Extracting multiple news attributes based on visual features. J Intell Inf Syst 38, 465–486 (2012). https://doi.org/10.1007/s10844-011-0163-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-011-0163-6

Keywords

Navigation