Learning to Extract Web News Title in Template Independent Way

Wang, Can; Wang, Junfeng; Chen, Chun; Lin, Li; Guan, Ziyu; Zhu, Junyan; Zhang, Cheng; Bu, Jiajun

doi:10.1007/978-3-642-02962-2_24

Learning to Extract Web News Title in Template Independent Way

Can Wang²⁵,
Junfeng Wang²⁵,
Chun Chen²⁵,
Li Lin²⁵,
Ziyu Guan²⁵,
Junyan Zhu²⁵,
Cheng Zhang²⁶ &
…
Jiajun Bu²⁵

Conference paper

2650 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5589))

Abstract

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: A vision based page segmentation algorithm. Technical report (2003)
Google Scholar
Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001)
Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2001, pp. 587–596. ACM, New York (2001)
Google Scholar
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005, pp. 250–257. ACM Press, New York (2005)
Google Scholar
Madden, M.: America’s Online Pursuits: The Changing Picture of Who’s Online and what They Do. Pew Internet & American Life Project (2003)
Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)
Google Scholar
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW 2004, pp. 203–211. ACM Press, New York (2004)
Google Scholar
Vapnik, V.: Principles of risk minimization for learning theory. In: NIPS 1991, pp. 831–838 (1991)
Google Scholar
Zheng, S., Song, R., Wen, J.R.: Joint optimization of wrapper generation and template detection. In: SIGKDD 2007, pp. 894–902. ACM Press, New York (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, China
Can Wang, Junfeng Wang, Chun Chen, Li Lin, Ziyu Guan, Junyan Zhu & Jiajun Bu
China Disabled Persons’ Federation Information Center, China
Cheng Zhang

Authors

Can Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Li Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Guan
View author publications
You can also search for this author in PubMed Google Scholar
Junyan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Bu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Surveying and Engineering, University of Southern Queensland, QLD 4350, Australia
Peng Wen
School of Information Technology, Queensland University of Technology, QLD 4001, Brisbane, Australia
Yuefeng Li
Institute of Mathematics, Warsaw University of Technology, Koszykowa 86, 02008, Warsaw, Poland
Lech Polkowski
Department of Computer Science, University of Regina, S4S 0A2, Regina, Saskatchewan, Canada
Yiyu Yao
Faculty of Medicine, Department of Medical Informatics, Shimane University, 89-1 Enya-cho, Izumo, 693-8501, Shimane, Japan
Shusaku Tsumoto
Institute of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China
Guoyin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C. et al. (2009). Learning to Extract Web News Title in Template Independent Way. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds) Rough Sets and Knowledge Technology. RSKT 2009. Lecture Notes in Computer Science(), vol 5589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02962-2_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-02962-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02961-5
Online ISBN: 978-3-642-02962-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics