Skip to main content

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

  • Conference paper
Software and Data Technologies (ICSOFT 2011)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 303))

Included in the following conference series:

Abstract

This chapter presents R2L, DANA and DANAg, a family of novel algorithms for extracting the main content (MC) of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other two algorithms, is to exploit particularities of Right-to-Left languages for obtaining the MC of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. Afterwards, the R2L approach extracts areas of the HTML file with a high density of Right-to-Left characters and a low density characters from the English character set. Having recognized these areas, R2L separates only the Right-to-Left characters as a result. The first extension, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the MC from areas with a high density of Right-to-Left characters. DANAg is the second extension and generalizes the idea of R2L to render it language independent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 285–293. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries (2001)

    Google Scholar 

  3. Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM Press, New York (2005)

    Chapter  Google Scholar 

  4. Gottron, T.: Evaluating content extraction on HTML documents. In: ITA 2007: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (September 2007)

    Google Scholar 

  5. Gottron, T.: Content code blurring: A new approach to content extraction. In: DEXA 2008: 19th International Workshop on Database and Expert Systems Applications, pp. 29–33. IEEE Computer Society (September 2008)

    Google Scholar 

  6. Gottron, T.: An evolutionary approach to automatically optimise web content extraction. In: IIS 2009: Proceedings of the 17th International Conference Intelligent Information Systems, pp. 331–343 (2009)

    Google Scholar 

  7. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM Press, New York (2003)

    Google Scholar 

  8. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  9. Liu, C., Liao, B.: Gaussian smoothing-based web content extraction. International Journal of Advancements in Computing Technology 3(8), 255–262 (2011)

    Article  Google Scholar 

  10. Mantratzis, C., Orgun, M., Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: HYPERTEXT 2005: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pp. 145–147. ACM Press, New York (2005)

    Chapter  Google Scholar 

  11. Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: Extracting the main content of web documents based on a naive smoothing method. In: KDIR 2011: International Conference on Knowledge Discovery and Information Retrieval, pp. 470–475. SciTePress (2011)

    Google Scholar 

  12. Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: A fast and accurate approach for main content extraction based on character encoding. In: TIR 2011: Proceedings of the 8th International Workshop on Text-based Information Retrieval, DEXA 2011, pp. 167–171. IEEE Computer Society (2011)

    Google Scholar 

  13. Mohammadzadeh, H., Schweiggert, F., Nakhaeizadeh, G.: Using utf-8 to extract main content of right to left language web pages. In: Cuaresma, M.J.E., Shishkov, B., Cordeiro, J. (eds.) ICSOFT 2011 - Proceedings of the 6th International Conference on Software and Data Technologies, Seville, Spain, July 18-21, vol. 1, pp. 243–249. SciTePress (2011)

    Google Scholar 

  14. Moreno, J., Deschacht, K., Moens, M.: Language independent content extraction from web pages. In: Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55 (2009)

    Google Scholar 

  15. Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 971–980. ACM, New York (2009), http://doi.acm.org/10.1145/1526709.1526840

    Chapter  Google Scholar 

  16. Pinto, D., Branstein, M., Coleman, R., Croft, W.B., King, M., Li, W., Wei, X.: QuASM: a system for question answering using semi-structured data. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 46–55. ACM Press, New York (2002)

    Chapter  Google Scholar 

  17. Weninger, T., Hsu, W.H.: Text extraction from the web via text-tag-ratio. In: TIR 2008: Proceedings of the 5th International Workshop on Text Information Retrieval, pp. 23–28. IEEE Computer Society (September 2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G. (2013). Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method. In: Escalona, M.J., Cordeiro, J., Shishkov, B. (eds) Software and Data Technologies. ICSOFT 2011. Communications in Computer and Information Science, vol 303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36177-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36177-7_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36176-0

  • Online ISBN: 978-3-642-36177-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics