Skip to main content

Classification of News Web Documents Based on Structural Features

  • Conference paper
Advances in Natural Language Processing (FinTAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

Abstract

The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from non-article pages, e.g. table of contents, advertisements. Then, the selected web articles are compared in a fine-grained manner in order to find informative structures. Both steps of information extraction utilize the structural features of web documents rather than the extracted keywords or terms. Thus, the inherent errors of word segmentation, one of the major problems in Thai text processing, do not affect to this method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern discovery on the world wide web. In: ICTAI, pp. 558–567 (1997)

    Google Scholar 

  2. Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Chiang, R.H.L., Lim, E.P. (eds.) WIDM, pp. 96–99. ACM, New York (2002)

    Google Scholar 

  3. Holden, N., Freitas, A.A.: Web page classification with an ant colony algorithm. In: Yao, X., et al. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1092–1102. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. An, A., Huang, Y., Huang, X., Cercone, N.: Feature selection with rough sets for web page classification. In: Peters, J.F., Skowron, A., Dubois, D., Grzymała-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 1–13. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. He, J., Tan, A.H., Tan, C.L.: Machine learning methods for chinese web page categorization. In: ACL 2000 2nd Workshop on Chinese Language Processing, Hongkong, China, pp. 93–100 (2000)

    Google Scholar 

  6. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: [12], pp. 577–582

    Google Scholar 

  7. Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  8. Wong, W.C., Fu, A.W.C.: Finding structure and characteristics of web documents for classification. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)

    Google Scholar 

  9. Tombros, A., Ali, Z.: Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: [12], pp. 296–305

    Google Scholar 

  11. Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  12. Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C.: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. In: KDD, Washington, DC, USA, August 24 - 27, 2003. ACM, New York (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tongchim, S., Sornlertlamvanich, V., Isahara, H. (2006). Classification of News Web Documents Based on Structural Features. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_17

Download citation

  • DOI: https://doi.org/10.1007/11816508_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics