Advertisement

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

  • Hsin-Chang YangEmail author
  • Chung-Hong Lee
Article

Abstract

Recently research on text mining has attracted lots of attention from both industrial and academic fields. Text mining concerns of discovering unknown patterns or knowledge from a large text repository. The problem is not easy to tackle due to the semi-structured or even unstructured nature of those texts under consideration. Many approaches have been devised for mining various kinds of knowledge from texts. One important aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category if the document falls into the theme of the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. These maps were then analyzed to obtain the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.

Keywords

automatic category theme identification automatic category hierarchy generation text categorization self-organizing maps text mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Damerau, F., and Weiss, S.M. (1994). Automated Learning of Decision Rules for Text Categorization. ACM Trans. Information Systems, 12(3), 233–251.CrossRefGoogle Scholar
  2. Chen, A., He, J.Z., Xu, L.J., Gey, F.C., and Meggs, J. (1997). Chinese Text Retrieval Without Using a Dictionary. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 42–49).Google Scholar
  3. Clifton, C. and Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In Proc. European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD 99) (pp. 174–183).Google Scholar
  4. Cohen, W.W. and Singer, Y. (1996). Context-Sensitive Learning Methods for Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 307–315).Google Scholar
  5. Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. London: Chapman & Hall.Google Scholar
  6. Dai, Y., Loh, T.E., and Khoo, C., (1999). A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information. In Proc. 22th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 82–89).Google Scholar
  7. Deerwester, S., Dumais, S., Furnas, G., and Landauer, K. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 40(6), 391–407.CrossRefGoogle Scholar
  8. Feldman, R., Dagan, I., and Hirsh, H. (1998). Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems, 10, 281–300.CrossRefGoogle Scholar
  9. Grobelnik, M. and Mladenić, D. (1998). Efficient Text Categorization. In Proc. Text Mining Workshop on ECML-98. Chemnitz, Germany.Google Scholar
  10. Hearst, M.A. and Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 246–255).Google Scholar
  11. Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-Length Document Access. In Proc. 16th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 59–68).Google Scholar
  12. Hofmann, T. (1999). The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. In Proc. Int’l Joint Conf. on Artificial Intelligence (IJCAI 99) (pp. 682–687).Google Scholar
  13. Huang, X. and Robertson, S.E. (1997a). Experiments on Large Test Collections with Probabilistic Approaches to Chinese Text Retrieval. In Proc. the 2nd Int’l Workshop on Information Retrieval With Asian Languages (pp. 129–140). Tsukuba, Japan.Google Scholar
  14. Huang, X. and Robertson, S.E. (1997b). Okapi Chinese Text Retrieval Experiments at TREC-6. In Proc. 6th Text Retrieval Conference (TREC-6) (pp. 137–142).Google Scholar
  15. Jolliffe, I.T. (1986), Principal Component Analysis. Berlin: Springer-Verlag.Google Scholar
  16. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). WEBSOM-Self-Organizing Maps of Document Collections. Neurocomputing, 21, 101–117.CrossRefGoogle Scholar
  17. Kohonen, T. (1997). Self-Organizing Maps. Berlin (Springer-Verlag).Google Scholar
  18. Lam, W., Ruiz, M., and Srinivasan, P. (1999). Automatic Text Categorization and Its Application to Text Retrieval. IEEE Trans. Knowledge and Data Engineering, 11(8), 865–879.CrossRefGoogle Scholar
  19. Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 289–297).Google Scholar
  20. Lee, C.H. and Yang, H.C. (1999). A Web Text Mining Approach Based on Self-Organizing Map. In Proc. ACM CIKM’99 2nd Workshop on Web Information and Data Management. (pp. 59–62) Kansas City, MI.Google Scholar
  21. Lewis, D.D. (1992). Feature Selection and Feature Extraction for Text Categorization. In Proc. Speech and Natural Language Workshop (pp. 212–217), Arden House.Google Scholar
  22. Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 298–306).Google Scholar
  23. Lin C.Y. (1995). Knowledge-Based Automatic Topic Identification. In Proc. Meeting of the Association for Computational Linguistics (ACL 95). (pp. 308–310).Google Scholar
  24. McCallum, A. and Nigam, K. (1999). Text Classification by Bootstrapping with Keywords, EM and Shrinkage. In Proc. ACL ‘99 Workshop for Unsupervised Learning in Natural Language Processing. (pp. 52–58).Google Scholar
  25. Mehnert, R. (1997). Federal Agency and Federal Library Reports, National Library of Medicine: 2 edition. Providence, NJ: R. R. Bowker.Google Scholar
  26. Nie, J.Y., Brisebois, M., and Ren, X. (1996). On Chinese Text Retrieval. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 225–233).Google Scholar
  27. Ponte, J.M. and Croft, W.B. (1997). Text Segmentation by Topic. In Proc. European Conference on Digital Libraries (ECDL 97) (pp. 113–125).Google Scholar
  28. Rajaraman, K., Lai, K.F., and Changwen, Y. (1997). Experiments on Proximity Based Chinese Text Retrieval in TREC 6. In Proc. 6th Text REtrieval Conference (TREC-6) (pp. 559–576).Google Scholar
  29. Rauber, A. and Merkl, D. (1999). Using Self-Organizing Maps to Organize Document Archives and to Characterize Subject Matter: How to Make a Map Tell the News of the World. In Proc. 10th International Conference on Database and Expert Systems Applications. (pp. 302–311).Google Scholar
  30. Rizzo, R., Allegra, M., and Fulantelli, G. (1998). Developing Hypertext through a Self-Organizing Map. In Proc. WebNet 98 (pp. 768–772) Orlando, USA.Google Scholar
  31. Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.Google Scholar
  32. Salton, G. and Singhal, A. (1994). Automatic Text Theme Generation and the Analysis of Text Structure. Technical Report TR 94-1438, Dept. Computer Science, Cornell Univ., Ithaca, NY.Google Scholar
  33. Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorization. Information Retrieval, 1(3), 193–216.CrossRefGoogle Scholar
  34. Wu, Z.M. and Tseng, G. (1993). Chinese Text Segmentation for Text Retrieval, Achievements and Problems. Journal of the American Society for Information Science, 44(9), 532–542.CrossRefGoogle Scholar
  35. Wu, Z.M. and Tseng, G. (1995). An Automatic Chinese Text Segmentation System for Full Text Retrieval. Journal of the American Society for Information Science, 46(2), 83–96.CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  1. 1.Department of Information ManagementChang Jung UniversityTainanTaiwan
  2. 2.Department of Electrical EngineeringNational Kaohsiung University of Applied SciencesKaohsiungTaiwan

Personalised recommendations