Skip to main content

Web News Pages Extraction Method Based on DOM and Decision Tree

  • Chapter

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 124))

Abstract

Content extraction for Web news pages is a basic work to many web applications and has to be solved well. This paper presents a new method to extract the contents of Web news pages. This method firstly parses the HTML code by a simple and convenient way that does not rely on a third-party toolkit, turningthe HTML structure into a more easily-operated DOM (Document Object Model) tree. And on this basis,select the sub-treecandidates which perhaps contain the main content of the page. Being the Element nodes of the DOM tree, four specific attributes of them we define in this paper are obtained. Anda decision tree can be trained according to these attributes.Because learning and predicting need a well-trained decision tree, identifying the news body sub tree among a number of sub trees in a page can be regarded as a classification problem.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, J., Chen, C., Wang, C., Pei, J.: Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site. In: KDD 2009, pp. 1345–1353 (2009)

    Google Scholar 

  2. Zheng, S., Song, R., Wen, J.: Template-independent news extraction based on visual consistency. In: AAAI 2007, vol. 22, pp. 1507–1513 (2007)

    Google Scholar 

  3. Mitchell, T.M.: Machine Learning. Decision Trees, ch. 3

    Google Scholar 

  4. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, pp. 502–511 (2004)

    Google Scholar 

  5. Mitchell, T.: Machine Learning.Decision Tree Learning, ch. 3. McGraw Hill (1997)

    Google Scholar 

  6. Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)

    Google Scholar 

  7. Shi, L., Tang, Y., Zhangxin, X.: Research on Decision Tree Technology in Data Mining. Computer and Modernization  (10) (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chen, Z., Lv, J.C. (2012). Web News Pages Extraction Method Based on DOM and Decision Tree. In: Qian, Z., Cao, L., Su, W., Wang, T., Yang, H. (eds) Recent Advances in Computer Science and Information Engineering. Lecture Notes in Electrical Engineering, vol 124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25781-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25781-0_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25780-3

  • Online ISBN: 978-3-642-25781-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics