Skip to main content

Learning to Extract Web News Title in Template Independent Way

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5589))

Abstract

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: A vision based page segmentation algorithm. Technical report (2003)

    Google Scholar 

  2. Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001)

    Google Scholar 

  3. Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards website adaptation. In: WWW 2001, pp. 587–596. ACM, New York (2001)

    Google Scholar 

  4. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: SIGIR 2005, pp. 250–257. ACM Press, New York (2005)

    Google Scholar 

  5. Madden, M.: America’s Online Pursuits: The Changing Picture of Who’s Online and what They Do. Pew Internet & American Life Project (2003)

    Google Scholar 

  6. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)

    Google Scholar 

  7. Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW 2004, pp. 203–211. ACM Press, New York (2004)

    Google Scholar 

  8. Vapnik, V.: Principles of risk minimization for learning theory. In: NIPS 1991, pp. 831–838 (1991)

    Google Scholar 

  9. Zheng, S., Song, R., Wen, J.R.: Joint optimization of wrapper generation and template detection. In: SIGKDD 2007, pp. 894–902. ACM Press, New York (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, C. et al. (2009). Learning to Extract Web News Title in Template Independent Way. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds) Rough Sets and Knowledge Technology. RSKT 2009. Lecture Notes in Computer Science(), vol 5589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02962-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02962-2_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02961-5

  • Online ISBN: 978-3-642-02962-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics