Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

Mohammadzadeh, Hadi; Gottron, Thomas; Schweiggert, Franz; Nakhaeizadeh, Gholamreza

doi:10.1007/978-3-642-36177-7_14

Hadi Mohammadzadeh⁴,
Thomas Gottron⁵,
Franz Schweiggert⁴ &
…
Gholamreza Nakhaeizadeh⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 303))

Included in the following conference series:

International Conference on Software and Data Technologies

543 Accesses
1 Citations

Abstract

This chapter presents R2L, DANA and DANAg, a family of novel algorithms for extracting the main content (MC) of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other two algorithms, is to exploit particularities of Right-to-Left languages for obtaining the MC of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. Afterwards, the R2L approach extracts areas of the HTML file with a high density of Right-to-Left characters and a low density characters from the English character set. Having recognized these areas, R2L separates only the Right-to-Left characters as a result. The first extension, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the MC from areas with a high density of Right-to-Left characters. DANAg is the second extension and generalizes the idea of R2L to render it language independent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 285–293. Springer, Heidelberg (2005)
Chapter Google Scholar
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries (2001)
Google Scholar
Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM Press, New York (2005)
Chapter Google Scholar
Gottron, T.: Evaluating content extraction on HTML documents. In: ITA 2007: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (September 2007)
Google Scholar
Gottron, T.: Content code blurring: A new approach to content extraction. In: DEXA 2008: 19th International Workshop on Database and Expert Systems Applications, pp. 29–33. IEEE Computer Society (September 2008)
Google Scholar
Gottron, T.: An evolutionary approach to automatically optimise web content extraction. In: IIS 2009: Proceedings of the 17th International Conference Intelligent Information Systems, pp. 331–343 (2009)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM Press, New York (2003)
Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Article MathSciNet MATH Google Scholar
Liu, C., Liao, B.: Gaussian smoothing-based web content extraction. International Journal of Advancements in Computing Technology 3(8), 255–262 (2011)
Article Google Scholar
Mantratzis, C., Orgun, M., Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: HYPERTEXT 2005: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pp. 145–147. ACM Press, New York (2005)
Chapter Google Scholar
Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: Extracting the main content of web documents based on a naive smoothing method. In: KDIR 2011: International Conference on Knowledge Discovery and Information Retrieval, pp. 470–475. SciTePress (2011)
Google Scholar
Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: A fast and accurate approach for main content extraction based on character encoding. In: TIR 2011: Proceedings of the 8th International Workshop on Text-based Information Retrieval, DEXA 2011, pp. 167–171. IEEE Computer Society (2011)
Google Scholar
Mohammadzadeh, H., Schweiggert, F., Nakhaeizadeh, G.: Using utf-8 to extract main content of right to left language web pages. In: Cuaresma, M.J.E., Shishkov, B., Cordeiro, J. (eds.) ICSOFT 2011 - Proceedings of the 6th International Conference on Software and Data Technologies, Seville, Spain, July 18-21, vol. 1, pp. 243–249. SciTePress (2011)
Google Scholar
Moreno, J., Deschacht, K., Moens, M.: Language independent content extraction from web pages. In: Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55 (2009)
Google Scholar
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 971–980. ACM, New York (2009), http://doi.acm.org/10.1145/1526709.1526840
Chapter Google Scholar
Pinto, D., Branstein, M., Coleman, R., Croft, W.B., King, M., Li, W., Wei, X.: QuASM: a system for question answering using semi-structured data. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 46–55. ACM Press, New York (2002)
Chapter Google Scholar
Weninger, T., Hsu, W.H.: Text extraction from the web via text-tag-ratio. In: TIR 2008: Proceedings of the 5th International Workshop on Text Information Retrieval, pp. 23–28. IEEE Computer Society (September 2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Applied Information Processing, University of Ulm, D-89069, Ulm, Germany
Hadi Mohammadzadeh & Franz Schweiggert
Institute for Web Science and Technologies, Universität Koblenz-Landau, D-56070, Koblenz, Germany
Thomas Gottron
Institute of Statistics, Econometrics and Mathematical Finance, University of Karlsruhe, D-76128, Karlsruhe, Germany
Gholamreza Nakhaeizadeh

Authors

Hadi Mohammadzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Gottron
View author publications
You can also search for this author in PubMed Google Scholar
Franz Schweiggert
View author publications
You can also search for this author in PubMed Google Scholar
Gholamreza Nakhaeizadeh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ETS Ingeniería Informática, Universidad de Sevilla, Av. Reina Mercedes S/N, 41012, Sevilla, Spain
María José Escalona
Department of Systems and Informatics, INSTICC / IPS, Rua do Vale de Chaves, 2910-761, Estefanilha, Setúbal, Portugal
José Cordeiro
IICREST, P.O. Box 104, 1618, Sofia, Bulgaria
Boris Shishkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G. (2013). Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method. In: Escalona, M.J., Cordeiro, J., Shishkov, B. (eds) Software and Data Technologies. ICSOFT 2011. Communications in Computer and Information Science, vol 303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36177-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-36177-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36176-0
Online ISBN: 978-3-642-36177-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics