Hybrid Method for Automated News Content Extraction from the Web

Li, Yu; Meng, Xiaofeng; Li, Qing; Wang, Liping

doi:10.1007/11912873_34

Hybrid Method for Automated News Content Extraction from the Web

Yu Li²¹,
Xiaofeng Meng²¹,
Qing Li²² &
…
Liping Wang²²

Conference paper

651 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4255))

Abstract

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)
Google Scholar
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Research Asia (2003)
Google Scholar
Can, L., Qian, Z., Meng, X.F., Lin, W.Y.: Postal address detection from web documents. In: Proc. of WIRI 2005, pp. 40–45 (2005)
Google Scholar
Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: Proc. of WWW 2001, pp. 681–688 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. of VLDB 2001, pp. 109–118 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Proc. of SAC 2002, pp. 1108–1112 (2002)
Google Scholar
Hu, Y.H., Xin, G.M., Song, R.H., Hu, G.P., Shi, S.M., Cao, Y.B., Li, H.: Title extraction from bodies of html documents and its application to web page retrieval. In: Proc. of SIGIR 2005, pp. 250–257 (2005)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Li, Q.Z., Moon, B.K.: Indexing and querying xml data for regular path expressions. In: Proc. of VLDB, pp. 361–370 (2001)
Google Scholar
Li, Y.: Evaluation of hybrid extraction method, Available at: http://idke.ruc.edu.cn/hybrid
Liu, B.: WISE-2005 Tutorial: Web Content Mining. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, p. 763. Springer, Heidelberg (2005)
Chapter Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.H.: Mining data records in web pages. In: Proc. of KDD 2003, pp. 601–606 (2003)
Google Scholar
Liu, B., Zhai, Y.: Net - a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Chapter Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: A hierarchical approach to wrapper induction. In: Proc. of Agents 1999, pp. 190–197 (1999)
Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proc. of WWW 2004, pp. 502–511 (2004)
Google Scholar
Udani, D.: Html parser project, Available at: http://sourceforge.net/projects/htmlparser
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proc. of WWW 2003, pp. 187–196 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. of WWW 2005, pp. 76–85 (2005)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Proc. of WWW 2005, pp. 66–75 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Renmin Univ. of China, China
Yu Li & Xiaofeng Meng
Computer Science Dept., City Univ. of Hong Kong, HKSAR, China
Qing Li & Liping Wang

Authors

Yu Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar
Liping Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
Karl Aberer
State Key Lab of Software Engineering, Wuhan University, 430072, wuhan, China
Zhiyong Peng
Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA
Elke A. Rundensteiner
Victoria University, Australia
Yanchun Zhang
Wuhan University, China
Xuhui Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Meng, X., Li, Q., Wang, L. (2006). Hybrid Method for Automated News Content Extraction from the Web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds) Web Information Systems – WISE 2006. WISE 2006. Lecture Notes in Computer Science, vol 4255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11912873_34

Download citation

DOI: https://doi.org/10.1007/11912873_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-48105-8
Online ISBN: 978-3-540-48107-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics