Abstract
Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: A vision based page segmentation algorithm. Technical report (2003)
Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001)
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2001, pp. 587–596. ACM, New York (2001)
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005, pp. 250–257. ACM Press, New York (2005)
Madden, M.: America’s Online Pursuits: The Changing Picture of Who’s Online and what They Do. Pew Internet & American Life Project (2003)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW 2004, pp. 203–211. ACM Press, New York (2004)
Vapnik, V.: Principles of risk minimization for learning theory. In: NIPS 1991, pp. 831–838 (1991)
Zheng, S., Song, R., Wen, J.R.: Joint optimization of wrapper generation and template detection. In: SIGKDD 2007, pp. 894–902. ACM Press, New York (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, C. et al. (2009). Learning to Extract Web News Title in Template Independent Way. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds) Rough Sets and Knowledge Technology. RSKT 2009. Lecture Notes in Computer Science(), vol 5589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02962-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-02962-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02961-5
Online ISBN: 978-3-642-02962-2
eBook Packages: Computer ScienceComputer Science (R0)