Skip to main content

DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

Web pages usually contain much irrelevant information that customers don’t need. Thus, in order to extract relevant information from the complicated information heap, effective methods to extract information are required. Aiming at the semi-structured characteristic of HTML, theme-relevant information in web pages could be extracted by semantic pruning, in the adoption of DOM-presentation, combined with the feature of web structure and the fuzzy classification of keywords.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, T., Tang, S., Yang, D., et al.: COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD 2002, p. 620. ACM Press, New York (2002)

    Google Scholar 

  2. Wang, Q., Tang, S., Yang, D., et al.: DOM-Based Automatic Extraction of Topical Information from Web Pages. Journal of Computer Research and Development 41(10), 1786–1791 (2004)

    Google Scholar 

  3. Gu, Y.-h., Tian, W.: Extraction of Information from Web Pages Based on Extended DOM Tree. Computer Science 36(11), 235–237 (2009)

    Google Scholar 

  4. Ou, J., Dong, S., Cai, B.: Topic information extraction from template web pages. Journal of Tsinghua University (Science and Technology) 45(1), 1743–1747 (2005)

    Google Scholar 

  5. Li, Y., Zhang, M.: A Fuzzy Extraction for Web Pages Based on Parallel Computing. Computer Engineering and Applications 21, 23–27 (2003)

    Google Scholar 

  6. Zheng, S.f., Liu, T., Qin, B., Li, S.: Overview of Question-Answering. Journal of Chinese Information Processing 16(06), 46–52 (2002)

    Google Scholar 

  7. Li, J.-j., Yan, H.-f.: Chinese Web Retrieval Test Collections: Construction, Analysis and Application. Journal of Chinese Information Processing (01), 30–36 (2008)

    Google Scholar 

  8. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM based Content Extraction of HTML Documents. In: 12th International World Wide Web Conference, vol. (5), pp. 207–214 (2003)

    Google Scholar 

  9. Freitag, D.: Machine learning for information extraction in information domains. Machine Learning 39(2/3), 169–202 (2000)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, J., Jia, J., Duan, L. (2011). DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23982-3_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23981-6

  • Online ISBN: 978-3-642-23982-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics