Abstract
Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wang, J., Chen, C., Wang, C., Pei, J.: Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site. In: KDD 2009, pp. 1345–1353 (2009)
Zheng, S., Song, R., Wen, J.: Template-independent news extraction based on visual consistency. In: AAAI 2007, vol. 22, pp. 1507–1513 (2007)
Mitchell, T.M.: Machine Learning. Decision Trees, ch. 3
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)
Mitchell, T.: Machine Learning.Decision Tree Learning, ch. 3. McGraw Hill (1997)
Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)
Shi, L., Tang, Y., Zhangxin, X.: Research on Decision Tree Technology in Data Mining. Computer and Modernization (10) (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chen, Z., Lv, J.C. (2012). Web News Pages Extraction Method Based on DOM and Decision Tree. In: Qian, Z., Cao, L., Su, W., Wang, T., Yang, H. (eds) Recent Advances in Computer Science and Information Engineering. Lecture Notes in Electrical Engineering, vol 124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25781-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-25781-0_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25780-3
Online ISBN: 978-3-642-25781-0
eBook Packages: EngineeringEngineering (R0)