Skip to main content

Hybrid Method for Automated News Content Extraction from the Web

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4255))

Abstract

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)

    Google Scholar 

  2. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Research Asia (2003)

    Google Scholar 

  3. Can, L., Qian, Z., Meng, X.F., Lin, W.Y.: Postal address detection from web documents. In: Proc. of WIRI 2005, pp. 40–45 (2005)

    Google Scholar 

  4. Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: Proc. of WWW 2001, pp. 681–688 (2001)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of VLDB 2001, pp. 109–118 (2001)

    Google Scholar 

  6. Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Proc. of SAC 2002, pp. 1108–1112 (2002)

    Google Scholar 

  7. Hu, Y.H., Xin, G.M., Song, R.H., Hu, G.P., Shi, S.M., Cao, Y.B., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: Proc. of SIGIR 2005, pp. 250–257 (2005)

    Google Scholar 

  8. Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  9. Li, Q.Z., Moon, B.K.: Indexing and querying xml data for regular path expressions. In: Proc. of VLDB, pp. 361–370 (2001)

    Google Scholar 

  10. Li, Y.: Evaluation of hybrid extraction method, Available at: http://idke.ruc.edu.cn/hybrid

  11. Liu, B.: WISE-2005 Tutorial: Web Content Mining. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, p. 763. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  12. Liu, B., Grossman, R.L., Zhai, Y.H.: Mining data records in web pages. In: Proc. of KDD 2003, pp. 601–606 (2003)

    Google Scholar 

  13. Liu, B., Zhai, Y.: Net - a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  14. Muslea, I., Minton, S., Knoblock, C.A.: A hierarchical approach to wrapper induction. In: Proc. of Agents 1999, pp. 190–197 (1999)

    Google Scholar 

  15. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proc. of WWW 2004, pp. 502–511 (2004)

    Google Scholar 

  16. Udani, D.: Html parser project, Available at: http://sourceforge.net/projects/htmlparser

  17. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proc. of WWW 2003, pp. 187–196 (2003)

    Google Scholar 

  18. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of WWW 2005, pp. 76–85 (2005)

    Google Scholar 

  19. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Proc. of WWW 2005, pp. 66–75 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, Y., Meng, X., Li, Q., Wang, L. (2006). Hybrid Method for Automated News Content Extraction from the Web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds) Web Information Systems – WISE 2006. WISE 2006. Lecture Notes in Computer Science, vol 4255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11912873_34

Download citation

  • DOI: https://doi.org/10.1007/11912873_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-48105-8

  • Online ISBN: 978-3-540-48107-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics