Extracting multiple news attributes based on visual features

Liu, Wei; Yan, Hualiang; Xiao, Jianguo

doi:10.1007/s10844-011-0163-6

Extracting multiple news attributes based on visual features

Published: 07 June 2011

Volume 38, pages 465–486, (2012)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Wei Liu¹,
Hualiang Yan² &
Jianguo Xiao²

274 Accesses
2 Citations
Explore all metrics

Abstract

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Making data visualization more efficient and effective: a survey

Article 19 November 2019

Xuedi Qin, Yuyu Luo, … Guoliang Li

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

A comprehensive survey on feature selection in the various fields of machine learning

Article 23 July 2021

Pradip Dhal & Chandrashekhar Azad

Notes

The origin is the top-left corner of a web page, and (x, y) is the top-left corner of a text block.

References

Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). VIPS: A vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79.
Chang, C.-H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.
Article Google Scholar
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: Towards automatic data extraction from large web sites. In Proc. of VLDB 2001 (pp. 109–118).
George, C., & Edward, G. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
Article MathSciNet Google Scholar
Hu, Q., & Huang, X. (2010). Passage extraction and result combination for genomics information retrieval. Journal of Intelligent Information Systems, 34(3), 249–274.
Article Google Scholar
Laender, A., Ribeiro-Neto, B., da Silva, A., & Teixeira, J. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.
Article Google Scholar
Lafferty, J. D., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML 2001 (pp. 282–289).
Liu, B., Grossman, R. L., & Zhai, Y. (2003). Mining data records in Web pages. In Proc. of KDD 2003 (pp. 601–606).
Lu, Y., He, H., Zhao, H., Meng, W., & Yu, C. T. (2007). Annotating structured data of the deep web. In Proc. of ICDE 2007 (pp. 376–385).
Luo, P., Fan, J., Liu, S., Lin, F., Xiong, Y., & Liu, J. (2009). Web article extraction for web printing: A DOM+visual based approach. In Proc. of ACM Symposium on Document Engineering 2009 (pp. 66–69).
Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proc. of WWW 2009 (pp. 971–980).
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann.
Reis, D., Golgher, P., & Silva, A. (2004). Automatic web news extraction using tree edit distance. In Proc. of WWW 2004 (pp. 502–511).
Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136.
Article Google Scholar
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Shi, Z., Milios, E., & Zincir-Heywood, N. (2005). Post-supervised template induction for information extraction from lists and tables in dynamic web sources. Journal of Intelligent Information Systems, 25(1), 69–93.
Article Google Scholar
Simon, K., & Lausen, G. (2005). ViPER: Augmenting automatic information extraction with visual perceptions. In Proc. of CIKM 2005 (pp. 381–388).
Singla, P., & Domingos, P. (2005). Discriminative training of Markov logic networks. In Proc. of AAAI 2005 (pp. 868–873).
Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proc. of WWW 2009 (pp. 1085–1086).
Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., & Ma, W.-Y. (2009). Incorporating site-level knowledge to extract structured data from web forums. In Proc. of WWW 2009 (pp. 181–190).
Zhai, Y., & Liu, B. (2005). Web data extraction based on partial tree alignment. In Proc. of WWW 2005 (pp. 76–85).
Zhao, H., Meng, W., & Wu, Z. (2005). Fully automatic wrapper generation for search engines. In Proc. of WWW 2005 (pp. 66–75).
Zheng, S., Song, R., & Wen, J.-R. (2007). Template-independent news extraction based on visual consistency. In Proc. of AAAI 2007 (pp. 1507–1511).
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., & Ma, W.-Y. (2005). 2D conditional random fields for Web information extraction. In Proc. of ICML 2005 (pp. 1044–1051).
Zhu, J., Nie, Z., & Wen, J.-R. (2006). Simultaneous record detection and attribute labeling in web data extraction. In Proc. of KDD 2006 (pp. 494–503).
Zhu, J., Nie, Z., & Zhang, B. (2007). Dynamic hierarchical Markov random fields and their application to web data extraction. In Proc. of ICML 2007 (pp. 1175–1182).

Download references

Acknowledgements

This research is supported by Advanced Research Fund of Institute of Scientific and Technical Information of China under grant YY-201005. The authors are also grateful to the anonymous reviewers for their constructive comments, which have helped improve the quality of the paper.

Author information

Authors and Affiliations

Institute of Scientific and Technical Information of China, 15 Fuxing Road, Beijing, 100038, China
Wei Liu
Institute of Computer Science & Technology, Peking University, 298 Chengfu Road, Beijing, 100871, China
Hualiang Yan & Jianguo Xiao

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hualiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Yan, H. & Xiao, J. Extracting multiple news attributes based on visual features. J Intell Inf Syst 38, 465–486 (2012). https://doi.org/10.1007/s10844-011-0163-6

Download citation

Received: 11 January 2011
Revised: 10 May 2011
Accepted: 10 May 2011
Published: 07 June 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s10844-011-0163-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Extracting multiple news attributes based on visual features

Abstract

Access this article

Similar content being viewed by others

Making data visualization more efficient and effective: a survey

A comprehensive and analytical review of text clustering techniques

A comprehensive survey on feature selection in the various fields of machine learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extracting multiple news attributes based on visual features

Abstract

Access this article

Similar content being viewed by others

Making data visualization more efficient and effective: a survey

A comprehensive and analytical review of text clustering techniques

A comprehensive survey on feature selection in the various fields of machine learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation