A Unified Approach for Extracting Multiple News Attributes from News Pages

Liu, Wei; Yan, Hualiang; Yang, Jianwu; Xiao, Jianguo

doi:10.1007/978-3-642-15246-7_17

A Unified Approach for Extracting Multiple News Attributes from News Pages

Wei Liu^21,22,
Hualiang Yan²¹,
Jianwu Yang^21,22 &
…
Jianguo Xiao²¹

Conference paper

1635 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6230))

Abstract

Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various web data integration applications, more useful news attributes, such as publication date, author, etc., need to be extracted structured stored for further processing. In this paper, we study the problem of automatically extracting multiple news attributes from news pages. Unlike the traditional ways(e.g. extracting news attributes separately or generating template-dependent wrappers), we propose an automatic, unified approach to extract them based on the visual features of news attributes which includes independent visual features and dependent visual features. The basic idea of our approach is that, first, the candidates of each news attribute are extracted from the news page based on their independent visual features, and then, the true value of each attribute is identified from the candidates based on dependent visual features(the layout relations among news attributes). The extensive experiments using a large number of news pages show that the proposed approach is highly effective and efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zheng, S., Song, R., Wen, J.-R.: Template-Independent News Extraction Based on Visual Consistency. In: AAAI 2007, pp. 1507–1511 (2007)
Google Scholar
Reis, D., Golgher, P., Silva, A.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005, pp. 76–85 (2005)
Google Scholar
Zhao, H., Meng, W., Wu, Z.: Fully automatic wrapper generation for search engines. In: WWW 2005, pp. 66–75 (2005)
Google Scholar
Xue, Y., Hu, Y., Xin, G.: Web page title extraction and its application. Inf. Process. Manage. 43(5), 1332–1347 (2007)
Article Google Scholar
Zhu, J., Nie, Z., Wen, J.-R.: 2D Conditional Random Fields for Web information extraction. In: ICML 2005, pp. 1044–1051 (2005)
Google Scholar
Zhu, J., Nie, Z., Wen, J.-R.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006, pp. 494–503 (2006)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB 2001, pp. 109–118 (2001)
Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601–606 (2003)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: ICML 2001, pp. 282–289 (2001)
Google Scholar
Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.T.: Annotating Structured Data of the Deep Web. In: ICDE 2007, pp. 376–385 (2007)
Google Scholar
Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., Lu, G.: News article extraction with template-independent wrapper. In: WWW 2009, pp. 1085–1086 (2009)
Google Scholar
Sarawagi, S., Cohen, W.W.: Semi-Markov Conditional Random Fields for Information Extraction. In: NIPS 2004 (2004)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79 (2003)
Google Scholar
Yao, L., Tang, J., Li, J.-Z.: A Unified Approach to Researcher Profiling. In: Web Intelligence 2007, pp. 359–366 (2007)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR 2003, pp. 235–242 (2003)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003, pp. 337–348 (2003)
Google Scholar
Zhu, J., Nie, Z., Zhang, B.: Dynamic hierarchical Markov random fields and their application to web data extraction. In: ICML 2007, pp. 1175–1182 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science & Technology, Peking University,
Wei Liu, Hualiang Yan, Jianwu Yang & Jianguo Xiao
Key Laboratory of Computational Linguistics, (Peking University), MOE, China, 100871
Wei Liu & Jianwu Yang

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hualiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, Seoul National University, 151-744, Seoul, Korea
Byoung-Tak Zhang
Department of Computing,, Macquarie University, NSW, Sydney, Australia
Mehmet A. Orgun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Yan, H., Yang, J., Xiao, J. (2010). A Unified Approach for Extracting Multiple News Attributes from News Pages. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-15246-7_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics