Abstract
The traditional Web news article contents extraction methods are time-costly and need much maintenance because they analyze the layout of news pages to generate the wrappers manually or automatically. In this paper, we propose a relevance-based analysis method to extract the news article contents from the news pages without the analysis of news page layouts before extraction. This method is applicable to the general news pages and we give the implementations of news extraction from different kinds of news sources.
Chapter PDF
Similar content being viewed by others
References
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic Web news extraction using tree edit distance. In: The Proceedings of the 13th International Conference on World Wide Web (2004)
Fukumoto, F., Suzuki, Y.: Detecting shifts in news stories for paragraph extraction. In: The 19th International Conference on Computational Linguistics (2002)
Zheng, S., Song, R., Wen, J.R.: Template-independent news extraction based on visual consistency. In: The Proceedings of the 22th AAAI Conference (2007)
Shinyama, Y.: Webstemmer (2007), http://www.unixuser.org/~euske/python/webstemmer/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, H., Tokuda, T. (2009). A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis. In: Gaedke, M., Grossniklaus, M., DÃaz, O. (eds) Web Engineering. ICWE 2009. Lecture Notes in Computer Science, vol 5648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02818-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-02818-2_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02817-5
Online ISBN: 978-3-642-02818-2
eBook Packages: Computer ScienceComputer Science (R0)