Classification of News Web Documents Based on Structural Features

Tongchim, Shisanu; Sornlertlamvanich, Virach; Isahara, Hitoshi

doi:10.1007/11816508_17

Shisanu Tongchim²¹,
Virach Sornlertlamvanich²¹ &
Hitoshi Isahara²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

International Conference on Natural Language Processing (in Finland)

1601 Accesses
2 Citations

Abstract

The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from non-article pages, e.g. table of contents, advertisements. Then, the selected web articles are compared in a fine-grained manner in order to find informative structures. Both steps of information extraction utilize the structural features of web documents rather than the extracted keywords or terms. Thus, the inherent errors of word segmentation, one of the major problems in Thai text processing, do not affect to this method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern discovery on the world wide web. In: ICTAI, pp. 558–567 (1997)
Google Scholar
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Chiang, R.H.L., Lim, E.P. (eds.) WIDM, pp. 96–99. ACM, New York (2002)
Google Scholar
Holden, N., Freitas, A.A.: Web page classification with an ant colony algorithm. In: Yao, X., et al. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1092–1102. Springer, Heidelberg (2004)
Chapter Google Scholar
An, A., Huang, Y., Huang, X., Cercone, N.: Feature selection with rough sets for web page classification. In: Peters, J.F., Skowron, A., Dubois, D., Grzymała-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 1–13. Springer, Heidelberg (2004)
Chapter Google Scholar
He, J., Tan, A.H., Tan, C.L.: Machine learning methods for chinese web page categorization. In: ACL 2000 2nd Workshop on Chinese Language Processing, Hongkong, China, pp. 93–100 (2000)
Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: [12], pp. 577–582
Google Scholar
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998)
Chapter Google Scholar
Wong, W.C., Fu, A.W.C.: Finding structure and characteristics of web documents for classification. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)
Google Scholar
Tombros, A., Ali, Z.: Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)
Chapter Google Scholar
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: [12], pp. 296–305
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C.: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. In: KDD, Washington, DC, USA, August 24 - 27, 2003. ACM, New York (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, 112 Paholyothin Road, Klong 1, Klong Luang, Pathumthani, 12120, Thailand
Shisanu Tongchim, Virach Sornlertlamvanich & Hitoshi Isahara

Authors

Shisanu Tongchim
View author publications
You can also search for this author in PubMed Google Scholar
Virach Sornlertlamvanich
View author publications
You can also search for this author in PubMed Google Scholar
Hitoshi Isahara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tongchim, S., Sornlertlamvanich, V., Isahara, H. (2006). Classification of News Web Documents Based on Structural Features. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_17

Download citation

DOI: https://doi.org/10.1007/11816508_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics