A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis

Han, Hao; Tokuda, Takehiro

doi:10.1007/978-3-642-02818-2_37

Hao Han¹⁹ &
Takehiro Tokuda¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5648))

Included in the following conference series:

International Conference on Web Engineering

1167 Accesses
3 Citations

Abstract

The traditional Web news article contents extraction methods are time-costly and need much maintenance because they analyze the layout of news pages to generate the wrappers manually or automatically. In this paper, we propose a relevance-based analysis method to extract the news article contents from the news pages without the analysis of news page layouts before extraction. This method is applicable to the general news pages and we give the implementations of news extraction from different kinds of news sources.

Download to read the full chapter text

Chapter PDF

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Article 08 July 2016

Extracting News Information Based on Webpage Segmentation and Parsing DOM Tree Reversely

Keywords

References

de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic Web news extraction using tree edit distance. In: The Proceedings of the 13th International Conference on World Wide Web (2004)
Google Scholar
Fukumoto, F., Suzuki, Y.: Detecting shifts in news stories for paragraph extraction. In: The 19th International Conference on Computational Linguistics (2002)
Google Scholar
Zheng, S., Song, R., Wen, J.R.: Template-independent news extraction based on visual consistency. In: The Proceedings of the 22th AAAI Conference (2007)
Google Scholar
Shinyama, Y.: Webstemmer (2007), http://www.unixuser.org/~euske/python/webstemmer/

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tokyo Institute of Technology Meguro, Tokyo, 152-8552, Japan
Hao Han & Takehiro Tokuda

Authors

Hao Han
View author publications
You can also search for this author in PubMed Google Scholar
Takehiro Tokuda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science, Chemnitz University of Technology, Strasse der Nationen 62, 09111, Chemnitz, Germany
Martin Gaedke
Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, Politecnico di Milano, 20133, Milano, Italy
Michael Grossniklaus
Department of Computer Languages and Systems Pº, University of the Basque Country, M. Lardizabal 1, 20018, San Sebastián, Spain
Oscar Díaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, H., Tokuda, T. (2009). A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis. In: Gaedke, M., Grossniklaus, M., Díaz, O. (eds) Web Engineering. ICWE 2009. Lecture Notes in Computer Science, vol 5648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02818-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-02818-2_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02817-5
Online ISBN: 978-3-642-02818-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis

Abstract

Chapter PDF

Similar content being viewed by others

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Extracting News Information Based on Webpage Segmentation and Parsing DOM Tree Reversely

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis

Abstract

Chapter PDF

Similar content being viewed by others

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Extracting News Information Based on Webpage Segmentation and Parsing DOM Tree Reversely

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation