Skip to main content

A Knowledge-Poor Approach to Turkish Text Categorization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

Document categorization is a way of determining a category for a given document. Supervised methods mostly rely on a training data and rich linguistic resources that are either language-specific or generic. This study proposes a knowledge-poor approach to text categorization without using any sets of rules or language specific resources such as part-of-speech tagger or shallow parser. Knowledge-poor here refers to lack of a reasonable amount of background knowledge. The proposed system architecture takes data as-is and simply separates tokens by space. Documents represented in vector space models are used as training data for many machine learning algorithm. We empirically examined and compared a several factors from similarity metrics to learning algorithms in a variety of experimental setups. Although researchers believe that some particular classifiers or metrics are better than others for text categorization, the recent studies disclose that the ranking of the models purely depends on the class, experimental setup and domain as well. The study features extensive evaluation, comparison within a variety of experiments. We evaluate models and similarity metrics for Turkish language as one of the agglutinative language especially within poor-knowledge framework. It is seen that output of the study would be very beneficial for other studies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Scott, S., Matwin, S.: Text Classification Using WordNet Hypernyms. The Workshop on usage of WordNet in NLP Systems. In: COLING-ACL (1998)

    Google Scholar 

  2. Salton, G., Wong, A., Yang, C.-S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  3. Liu, T., Chen, Z., Zhang, B., Ma, W.-Y., Wu, G.: Improving Text Classification using Local Latent Semantic Indexing. In: International Conference on Data Mining (ICDM 2004), pp. 162–169. IEEE Computer Society, Washington, DC (2004)

    Google Scholar 

  4. Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, pp. 81–93 (1994)

    Google Scholar 

  5. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 361–397 (2004)

    Google Scholar 

  6. Schtze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: ACM SIGIR 1995, New York, NY, USA, pp. 229–237 (1995)

    Google Scholar 

  7. Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: The Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  8. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pp. 42–49 (1999)

    Google Scholar 

  9. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)

    Google Scholar 

  10. Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: International Conference on Machine Learning, pp. 297–304 (2004)

    Google Scholar 

  11. Li, S., Xia, R., Zong, C., Huang, C.-R.: A framework of feature selection methods for text categorization. In: ACL, pp. 692–700 (2009)

    Google Scholar 

  12. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, pp. 784–788 (2003)

    Google Scholar 

  13. Chen, X., Wasikowski, M.: FAST: A roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD, Las Vegas, pp. 124–132 (2008)

    Google Scholar 

  14. Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38, 4978–4989 (2011)

    Article  Google Scholar 

  15. Manning, C.D., Schtze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  16. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  17. Sriurai, W.: Improving text categorization by using a topic model. Advanced Computing: An International Journal 2(6), 21–27 (2011)

    Google Scholar 

  18. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  19. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)

    Article  Google Scholar 

  20. Chen, Y.-T., Chen, M.C.: Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 38(4), 3085–3090 (2011)

    Article  Google Scholar 

  21. Singhal, A.: Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)

    Google Scholar 

  22. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR 1994, pp. 192–201 (1994)

    Google Scholar 

  23. Lang, K.: NewsWeeder: learning to filter netnews. Paper Presented at the Meeting of the Proceedings of the 12th International Conference on Machine Learning (1995)

    Google Scholar 

  24. Witten, I.H., Frank, E.: Data mining: Practical machine learning tools with java implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  25. Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl., 2758–2765 (2011)

    Google Scholar 

  26. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  27. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3) (2011)

    Google Scholar 

  28. Amasyalı, M.F., Diri, B.: Automatic turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  29. Guran, A., Akyokus, S., Bayazit, N.G., Gurbuz, M.Z.: Turkish Text Categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, pp. 369–373 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yildirim, S. (2014). A Knowledge-Poor Approach to Turkish Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54903-8_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54902-1

  • Online ISBN: 978-3-642-54903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics