Abstract
This chapter presents R2L, DANA and DANAg, a family of novel algorithms for extracting the main content (MC) of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other two algorithms, is to exploit particularities of Right-to-Left languages for obtaining the MC of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. Afterwards, the R2L approach extracts areas of the HTML file with a high density of Right-to-Left characters and a low density characters from the English character set. Having recognized these areas, R2L separates only the Right-to-Left characters as a result. The first extension, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the MC from areas with a high density of Right-to-Left characters. DANAg is the second extension and generalizes the idea of R2L to render it language independent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 285–293. Springer, Heidelberg (2005)
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: Content classification for digital libraries. In: DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries (2001)
Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM Press, New York (2005)
Gottron, T.: Evaluating content extraction on HTML documents. In: ITA 2007: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (September 2007)
Gottron, T.: Content code blurring: A new approach to content extraction. In: DEXA 2008: 19th International Workshop on Database and Expert Systems Applications, pp. 29–33. IEEE Computer Society (September 2008)
Gottron, T.: An evolutionary approach to automatically optimise web content extraction. In: IIS 2009: Proceedings of the 17th International Conference Intelligent Information Systems, pp. 331–343 (2009)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: WWW 2003: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM Press, New York (2003)
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Liu, C., Liao, B.: Gaussian smoothing-based web content extraction. International Journal of Advancements in Computing Technology 3(8), 255–262 (2011)
Mantratzis, C., Orgun, M., Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: HYPERTEXT 2005: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pp. 145–147. ACM Press, New York (2005)
Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: Extracting the main content of web documents based on a naive smoothing method. In: KDIR 2011: International Conference on Knowledge Discovery and Information Retrieval, pp. 470–475. SciTePress (2011)
Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G.: A fast and accurate approach for main content extraction based on character encoding. In: TIR 2011: Proceedings of the 8th International Workshop on Text-based Information Retrieval, DEXA 2011, pp. 167–171. IEEE Computer Society (2011)
Mohammadzadeh, H., Schweiggert, F., Nakhaeizadeh, G.: Using utf-8 to extract main content of right to left language web pages. In: Cuaresma, M.J.E., Shishkov, B., Cordeiro, J. (eds.) ICSOFT 2011 - Proceedings of the 6th International Conference on Software and Data Technologies, Seville, Spain, July 18-21, vol. 1, pp. 243–249. SciTePress (2011)
Moreno, J., Deschacht, K., Moens, M.: Language independent content extraction from web pages. In: Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55 (2009)
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 971–980. ACM, New York (2009), http://doi.acm.org/10.1145/1526709.1526840
Pinto, D., Branstein, M., Coleman, R., Croft, W.B., King, M., Li, W., Wei, X.: QuASM: a system for question answering using semi-structured data. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 46–55. ACM Press, New York (2002)
Weninger, T., Hsu, W.H.: Text extraction from the web via text-tag-ratio. In: TIR 2008: Proceedings of the 5th International Workshop on Text Information Retrieval, pp. 23–28. IEEE Computer Society (September 2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mohammadzadeh, H., Gottron, T., Schweiggert, F., Nakhaeizadeh, G. (2013). Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method. In: Escalona, M.J., Cordeiro, J., Shishkov, B. (eds) Software and Data Technologies. ICSOFT 2011. Communications in Computer and Information Science, vol 303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36177-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-36177-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36176-0
Online ISBN: 978-3-642-36177-7
eBook Packages: Computer ScienceComputer Science (R0)